[
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche resolved NUTCH-2102.
----------------------------------
Resolution: Fixed
Committed revision 1704634.
Thanks for the reviews
> WARC Exporter
> -------------
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
> Issue Type: Improvement
> Components: commoncrawl, dumpers
> Affects Versions: 1.10
> Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike
> the code submitted in [https://github.com/apache/nutch/pull/55] which is
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and
> hence should be able to cope with large segments in a timely fashion and also
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files,
> which is outside the scope of the CCDD anyway. Also WARC is not specific to
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this
> class providing similar functionalities.
> This class is called in the following way
> ./nutch org.apache.nutch.tools.warc.WARCExporter
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)