Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "CommonCrawlDataDumper" page has been changed by GiuseppeTotaro:
https://wiki.apache.org/nutch/CommonCrawlDataDumper

New page:
The CommonCrawlDataDumper is a Nutch tool able to dump out Nutch segments into 
[[http://commoncrawl.org/the-data/get-started/|CommonCrawl]] data format. 

https://issues.apache.org/jira/browse/NUTCH-1949

Currently, the CommonCrawlDataDumper tool is able to perfom the following steps:
 1. deserialize the crawled data from Nutch
 2. map serialized data on the proper JSON structure
 3. serialize the data into CBOR format
 4. optionally, compress the serialized data using gzip

This tool is able to work with either single Nutch segments or directory 
including segments as input data.

== CBOR ==

[[http://cbor.io/|CBOR]] (RFC 7049 Concise Binary Object Representation) 
provides an object encoding format for serialization purposes. CBOR encoding is 
really simple, because it stores the information itself also in the first byte 
when it’s small enough. So the encoding is really comprehensive in contrast to 
most other encodings.

Reply via email to