Jorge Luis Betancourt Gonzalez created NUTCH-2095:
-----------------------------------------------------
Summary: WARC exporter for the CommonCrawlDataDumper
Key: NUTCH-2095
URL: https://issues.apache.org/jira/browse/NUTCH-2095
Project: Nutch
Issue Type: Improvement
Components: commoncrawl, tool
Affects Versions: 1.11
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
Adds the possibility of exporting the nutch segments to a WARC files.
>From the usage point of view a couple of new command line options are
>available:
{{-warc}}: enables the functionality to export into WARC files, if not
specified the default JACKSON formatter is used.
{{-warcSize}}: enable the option to define a max file size for each WARC file,
if not specified a default of 1GB per file is used as recommended by the WARC
ISO standard.
The usual {{-gzip}} flag can be used to enable compression on the WARC files.
Some changes to the default {{CommonCrawlDataDumper}} were done, essentially
some changes to the Factory and to the Formats. This changes avoid creating a
new instance of a {{CommmonCrawlFormat}} on each URL read from the segments.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)