Giuseppe Totaro created NUTCH-1997:
--------------------------------------

             Summary: Add CBOR "magic header" to CommonCrawlDataDumper output
                 Key: NUTCH-1997
                 URL: https://issues.apache.org/jira/browse/NUTCH-1997
             Project: Nutch
          Issue Type: Bug
          Components: tool
            Reporter: Giuseppe Totaro


For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
wraps a single string value, representing the JSON text, into CBOR. 
For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
the first byte of all files is "0x7F" (the first three bits are "011", that is 
the major type for strings, and the following 5 bits are "11010", meaning a 
uint32_t encodes the length of following text), and the following 4 bytes 
(single-precision float) encodes the right length of file (as described in 
[RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
currently included into the file (a list of cbor tags is available 
[here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
In order to add support for CBOR detection using Apache Tika (as described in 
[TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be great 
if {{CommonCrawlDataDumper}} tool is able to add the self-describing CBOR 
"magic header" ([Tag 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) 
to CBOR-encoded output files. 
Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] for 
supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to