[ https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche resolved NUTCH-2102. ---------------------------------- Resolution: Fixed Committed revision 1704634. Thanks for the reviews > WARC Exporter > ------------- > > Key: NUTCH-2102 > URL: https://issues.apache.org/jira/browse/NUTCH-2102 > Project: Nutch > Issue Type: Improvement > Components: commoncrawl, dumpers > Affects Versions: 1.10 > Reporter: Julien Nioche > Attachments: NUTCH-2102.patch > > > This patch adds a WARC exporter > [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike > the code submitted in [https://github.com/apache/nutch/pull/55] which is > based on the CommonCrawlDataDumper, this exporter is a MapReduce job and > hence should be able to cope with large segments in a timely fashion and also > is not limited to the local file system. > Later on we could have a WARCImporter to generate segments from WARC files, > which is outside the scope of the CCDD anyway. Also WARC is not specific to > CommonCrawl, which is why the package name does not reflect it. > I don't think it would be a problem to have both the modified CCDD and this > class providing similar functionalities. > This class is called in the following way > ./nutch org.apache.nutch.tools.warc.WARCExporter > /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)