[jira] [Commented] (NUTCH-2102) WARC Exporter

Julien Nioche (JIRA) Wed, 16 Sep 2015 04:00:01 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14747300#comment-14747300
 ]


Julien Nioche commented on NUTCH-2102:
--------------------------------------

The only modification to existing code is in the class 
'src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java'
 where we added two new config elements :
* store.http.request
* store.http.headers
which are used to keep the request and http headers verbatim in the content 
metadata. Both are set to false by default.

Note that this is also used by [#55](https://github.com/apache/nutch/pull/55)


> WARC Exporter
> -------------
>
>                 Key: NUTCH-2102
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2102
>             Project: Nutch
>          Issue Type: Improvement
>          Components: commoncrawl, dumpers
>    Affects Versions: 1.10
>            Reporter: Julien Nioche
>         Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both 
> [https://github.com/apache/nutch/pull/55] and this class providing similar 
> functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2102) WARC Exporter

Reply via email to