Julien Nioche created NUTCH-2102:
------------------------------------
Summary: WARC Exporter
Key: NUTCH-2102
URL: https://issues.apache.org/jira/browse/NUTCH-2102
Project: Nutch
Issue Type: Improvement
Components: commoncrawl, dumpers
Affects Versions: 1.10
Reporter: Julien Nioche
This patch adds a WARC exporter
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the
code submitted in [https://github.com/apache/nutch/pull/55] which is based on
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be
able to cope with large segments in a timely fashion and also is not limited to
the local file system.
Later on we could have a WARCImporter to generate segments from WARC files,
which is outside the scope of the CCDD anyway. Also WARC is not specific to
CommonCrawl, which is why the package name does not reflect it.
I don't think it would be a problem to have both
[https://github.com/apache/nutch/pull/55] and this class providing similar
functionalities.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)