[ 
https://issues.apache.org/jira/browse/NUTCH-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated NUTCH-1997:
-----------------------------------
    Attachment: NUTCH-1997.patch

[~Lukeliush], you can find in attachment the patch to write the CBOR tag at the 
beginning of the document.
Unfortunately, the {{WRITE_TYPE_HEADER}} feature (that enables that tag) in 
[jackson-dataformat-cbor|https://github.com/FasterXML/jackson-dataformat-cbor] 
is not yet supported. It will be supported since 2.5.
Therefore, I implemented a very simple method ({{writeMagicHeader}}) that 
writes directly to the document the serialized CBOR tag ({{0xd9d9f7}}) as 
described in [RFC 7049|http://tools.ietf.org/html/rfc7049#section-2.4.5].
[~Lukeliush], please could you test this patch and try if Tika is able to 
detect rightly the file?
I will upload another patch 
([NUTCH-1998|https://issues.apache.org/jira/browse/NUTCH-1998]) to add support 
for a user-defined extension. 
Thanks a lot.

> Add CBOR "magic header" to CommonCrawlDataDumper output
> -------------------------------------------------------
>
>                 Key: NUTCH-1997
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1997
>             Project: Nutch
>          Issue Type: Improvement
>          Components: tool
>            Reporter: Giuseppe Totaro
>            Priority: Minor
>         Attachments: NUTCH-1997.patch
>
>
> For each file extracted from Nutch crawled data, {{CommonCrawlDataDumper}} 
> wraps a single string value, representing the JSON text, into CBOR. 
> For instance, using the Unix {{hexdump}} tool, we can see that, as expected, 
> the first byte of all files is "0x7F" (the first three bits are "011", that 
> is the major type for strings, and the following 5 bits are "11010", meaning 
> a uint32_t encodes the length of following text), and the following 4 bytes 
> (single-precision float) encodes the right length of file (as described in 
> [RFC7049|http://tools.ietf.org/html/rfc7049]). Therefore, a CBOR tag is 
> currently included into the file (a list of cbor tags is available 
> [here|https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml]).
> In order to add support for CBOR detection using Apache Tika (as described in 
> [TIKA-1610|https://issues.apache.org/jira/browse/TIKA-1610]), it would be 
> great if {{CommonCrawlDataDumper}} tool is able to add the self-describing 
> CBOR "magic header" ([Tag 
> 55799|http://tools.ietf.org/html/rfc7049#section-2.4.5]) to CBOR-encoded 
> output files. 
> Thanks a lot [~Lukeliush] for this great research. Thanks [~chrismattmann] 
> for supporting me on this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to