Giuseppe Totaro created NUTCH-1949:
--------------------------------------

             Summary: Dump out the Nuth data into the Common Crawl format
                 Key: NUTCH-1949
                 URL: https://issues.apache.org/jira/browse/NUTCH-1949
             Project: Nutch
          Issue Type: New Feature
            Reporter: Giuseppe Totaro


We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
{{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
# deserialize the crawled data from Nutch
# map serialized data on the proper JSON structure
# serialize the data into [CBOR|http://cbor.io] format
# optionally, compress the serialized data using {{gzip}}

This tool has to be able to work with either single Nutch segments or directory 
including segments as input data.

Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support and 
code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to