Julien Nioche created NUTCH-2102:
------------------------------------

             Summary: WARC Exporter
                 Key: NUTCH-2102
                 URL: https://issues.apache.org/jira/browse/NUTCH-2102
             Project: Nutch
          Issue Type: Improvement
          Components: commoncrawl, dumpers
    Affects Versions: 1.10
            Reporter: Julien Nioche


This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both 
[https://github.com/apache/nutch/pull/55] and this class providing similar 
functionalities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to