[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

jorgelbg Fri, 11 Sep 2015 07:49:55 -0700

GitHub user jorgelbg opened a pull request:

    https://github.com/apache/nutch/pull/55


    WARC exporter for the CommonCrawlDataDumper

    This adds the possibility of exporting the nutch segments to a WARC files. 
    
    From the usage point of view a couple of new command line options are 
available: 
    
    * `-warc`: enables the functionality to export into WARC files, if not 
specified the default JACKSON formatter is used.
    * `-warcSize`: enable the option to define a max file size for each WARC 
file, if not specified a default of 1GB per file is used as recommended by the 
WARC ISO standard.
    
    The usual `-gzip` flag can be used to enable compression on the WARC files, 
which allow to compress the output files. 
    
    Some changes to the default CommonCrawlDataDumper were done, essentially 
some changes to the Factory and to the Formats. This changes avoid creating a 
new instance of a CommmonCrawlFormat on each URL read from the segments. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DigitalPebble/nutch warc

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/55.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #55
    
----
commit 0a627e5a5098a2ad4818b594fe567ea7fdd2c131
Author: Jorge Luis Betancourt <[email protected]>
Date:   2015-09-08T13:21:04Z

    Initial version of the CommonCrawlWARCFormat, generates valid metadata, 
response and request records. The request
    records only provide partial information, roughly the same as the 
CommonCrawl Data Dumper at the moment.

commit 1889a0b64d48005499f4de01ed18724087feb0f7
Author: Jorge Luis Betancourt <[email protected]>
Date:   2015-09-08T16:37:27Z

    Adding the WARCUtils class and the dependency to the ivy.xml file to avoid 
the fetching of another hadoop dependency

commit 169e5a4a4172424b31c91e232bb69056b10827c7
Author: Jorge Luis Betancourt <[email protected]>
Date:   2015-09-08T18:21:47Z

    Removing the transitive property of the ivy.xml file to avoid any future 
troubles

commit ede35d1aa767741ec5206de7990910fc661983e8
Author: Jorge Luis Betancourt <[email protected]>
Date:   2015-09-10T17:57:11Z

    Doing some refactoring on the existing code, essentially trying to avoid 
creating an instance of each CommonCrawlFormat
    per URL processed, since the format is content indepdendent at the momento 
the factory should allow to create a format
    without this data.
    
    Added a close method to the the CommonCrawlFormat interface for those cases 
when the format needs some closing
    statement.

commit 44beb74172364556f70b6f08d0a8ee511c99eff4
Author: Jorge Luis Betancourt <[email protected]>
Date:   2015-09-11T14:34:42Z

    Adding the changes to the main CCDataDumper class to call the WARC exporter 
tool.
    Changes to the Jackson format to work with the new structure.
    Changes to the FormatFactory to create the right Jacson/WARC instance.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

Reply via email to