[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
---------------------------------
    Attachment: NUTCH-2102.patch

> WARC Exporter
> -------------
>
>                 Key: NUTCH-2102
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2102
>             Project: Nutch
>          Issue Type: Improvement
>          Components: commoncrawl, dumpers
>    Affects Versions: 1.10
>            Reporter: Julien Nioche
>         Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both 
> [https://github.com/apache/nutch/pull/55] and this class providing similar 
> functionalities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to