GitHub user jorgelbg opened a pull request:
https://github.com/apache/nutch/pull/55
WARC exporter for the CommonCrawlDataDumper
This adds the possibility of exporting the nutch segments to a WARC files.
From the usage point of view a couple of new command line options are
available:
* `-warc`: enables the functionality to export into WARC files, if not
specified the default JACKSON formatter is used.
* `-warcSize`: enable the option to define a max file size for each WARC
file, if not specified a default of 1GB per file is used as recommended by the
WARC ISO standard.
The usual `-gzip` flag can be used to enable compression on the WARC files,
which allow to compress the output files.
Some changes to the default CommonCrawlDataDumper were done, essentially
some changes to the Factory and to the Formats. This changes avoid creating a
new instance of a CommmonCrawlFormat on each URL read from the segments.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/DigitalPebble/nutch warc
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/nutch/pull/55.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #55
----
commit 0a627e5a5098a2ad4818b594fe567ea7fdd2c131
Author: Jorge Luis Betancourt <[email protected]>
Date: 2015-09-08T13:21:04Z
Initial version of the CommonCrawlWARCFormat, generates valid metadata,
response and request records. The request
records only provide partial information, roughly the same as the
CommonCrawl Data Dumper at the moment.
commit 1889a0b64d48005499f4de01ed18724087feb0f7
Author: Jorge Luis Betancourt <[email protected]>
Date: 2015-09-08T16:37:27Z
Adding the WARCUtils class and the dependency to the ivy.xml file to avoid
the fetching of another hadoop dependency
commit 169e5a4a4172424b31c91e232bb69056b10827c7
Author: Jorge Luis Betancourt <[email protected]>
Date: 2015-09-08T18:21:47Z
Removing the transitive property of the ivy.xml file to avoid any future
troubles
commit ede35d1aa767741ec5206de7990910fc661983e8
Author: Jorge Luis Betancourt <[email protected]>
Date: 2015-09-10T17:57:11Z
Doing some refactoring on the existing code, essentially trying to avoid
creating an instance of each CommonCrawlFormat
per URL processed, since the format is content indepdendent at the momento
the factory should allow to create a format
without this data.
Added a close method to the the CommonCrawlFormat interface for those cases
when the format needs some closing
statement.
commit 44beb74172364556f70b6f08d0a8ee511c99eff4
Author: Jorge Luis Betancourt <[email protected]>
Date: 2015-09-11T14:34:42Z
Adding the changes to the main CCDataDumper class to call the WARC exporter
tool.
Changes to the Jackson format to work with the new structure.
Changes to the FormatFactory to create the right Jacson/WARC instance.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---