[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-2102:
---------------------------------
    Description: 
This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both the modified CCDD and this 
class providing similar functionalities.

This class is called in the following way 

./nutch org.apache.nutch.tools.warc.WARCExporter /data/nutch-dipe/1kcrawl/warc 
-dir /data/nutch-dipe/1kcrawl/segments/

  was:
This patch adds a WARC exporter 
[http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike the 
code submitted in [https://github.com/apache/nutch/pull/55] which is based on 
the CommonCrawlDataDumper, this exporter is a MapReduce job and hence should be 
able to cope with large segments in a timely fashion and also is not limited to 
the local file system.

Later on we could have a WARCImporter to generate segments from WARC files, 
which is outside the scope of the CCDD anyway. Also WARC is not specific to 
CommonCrawl, which is why the package name does not reflect it.

I don't think it would be a problem to have both 
[https://github.com/apache/nutch/pull/55] and this class providing similar 
functionalities.

This class is called in the following way 

./nutch org.apache.nutch.tools.warc.WARCExporter /data/nutch-dipe/1kcrawl/warc 
-dir /data/nutch-dipe/1kcrawl/segments/


> WARC Exporter
> -------------
>
>                 Key: NUTCH-2102
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2102
>             Project: Nutch
>          Issue Type: Improvement
>          Components: commoncrawl, dumpers
>    Affects Versions: 1.10
>            Reporter: Julien Nioche
>         Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to