[
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris A. Mattmann resolved NUTCH-1997.
--------------------------------------
Resolution: Fixed
thanks [~gostep] and [~Lukeliush]!
{noformat}
[chipotle:~/tmp/nutch-1.10-trunk] mattmann% svn commit -m "NUTCH-1997: Fix for
Add CBOR magic header to CommonCrawlDataDumper output contributed by Giuseppe
Totaro, and Luke Sh."
Sending CHANGES.txt
Sending src/java/org/apache/nutch/tools/CommonCrawlDataDumper.java
Transmitting file data ..
Committed revision 1676029.
[chipotle:~/tmp/nutch-1.10-trunk] mattmann%
{noformat}
> Add CBOR "magic header" to CommonCrawlDataDumper output
> -------------------------------------------------------
>
> Key: NUTCH-1997
> URL: https://issues.apache.org/jira/browse/NUTCH-1997
> Project: Nutch
> Issue Type: Improvement
> Components: tool
> Reporter: Giuseppe Totaro
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 1.10
>
> Attachments: NUTCH-1997.patch
>
>
> For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}}
> wraps a single string value, representing the JSON text, into CBOR.
> For instance, using the Unix {{hexdump}} tool, we can see that, as expected,
> the first byte of all files is "0x7F" (the first three bits are "011", that
> is the major type for strings, and the following 5 bits are "11010", meaning
> a uint32_t encodes the length of following text), and the following 4 bytes
> (single-precision float) encodes the right length of file (as described in
> [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is
> currently included into the file (a list of cbor tags is available
> [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
> In order to add support for CBOR detection using Apache Tika (as described in
> [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be
> great if {{CommonCrawlDataDumper}} tool is able to add the self-describing
> CBOR "magic header" ([Tag
> 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded
> output files.
> Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann]
> for supporting me on this work.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)