Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "CommonCrawlDataDumper" page has been changed by GiuseppeTotaro: https://wiki.apache.org/nutch/CommonCrawlDataDumper New page: The CommonCrawlDataDumper is a Nutch tool able to dump out Nutch segments into [[http://commoncrawl.org/the-data/get-started/|CommonCrawl]] data format. https://issues.apache.org/jira/browse/NUTCH-1949 Currently, the CommonCrawlDataDumper tool is able to perfom the following steps: 1. deserialize the crawled data from Nutch 2. map serialized data on the proper JSON structure 3. serialize the data into CBOR format 4. optionally, compress the serialized data using gzip This tool is able to work with either single Nutch segments or directory including segments as input data. == CBOR == [[http://cbor.io/|CBOR]] (RFC 7049 Concise Binary Object Representation) provides an object encoding format for serialization purposes. CBOR encoding is really simple, because it stores the information itself also in the first byte when it’s small enough. So the encoding is really comprehensive in contrast to most other encodings.

