[
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14510402#comment-14510402
]
Luke sh commented on TIKA-1610:
-------------------------------
Thanks a lot [~gagravarr] for the prompt response.
I thought it would be probably be risky if we discard any one of the estimated
types because of the magic priority (one is higher than the other, i wanted
tika to rely on the extension when there is a tie to break.
For now, in this particular case, i also cannot think of any reason why we
don't use 60, might be i am too skeptical.
Thanks
> CBOR Parser and detection [improvement]
> ---------------------------------------
>
> Key: TIKA-1610
> URL: https://issues.apache.org/jira/browse/TIKA-1610
> Project: Tika
> Issue Type: New Feature
> Components: detector, mime, parser
> Affects Versions: 1.7
> Reporter: Luke sh
> Assignee: Chris A. Mattmann
> Priority: Trivial
> Labels: memex
> Attachments: 1424402690000.html, NUTCH-1997.cbor,
> cbor_tika.mimetypes.xml.jpg, rfc_cbor.jpg
>
>
> CBOR is a data format whose design goals include the possibility of extremely
> small code size, fairly small message size, and extensibility without the
> need for version negotiation (cited from http://cbor.io/ ).
> It would be great if Tika is able to provide the support with CBOR parser and
> identification. In the current project with Nutch, the Nutch
> CommonCrawlDataDumper is used to dump the crawled segments to the files in
> the format of CBOR. In order to read/parse those dumped files by this tool,
> it would be great if tika is able to support parsing the cbor, the thing is
> that the CommonCrawlDataDumper is not dumping with correct extension, it
> dumps with its own rule, the default extension of the dumped file is html, so
> it might be less painful if tika is able to detect and parse those files
> without any pre-processing steps.
> CommonCrawlDataDumper is calling the following to dump with cbor.
> import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
> import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
> fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
> According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like
> CBOR does not yet have its magic numbers to be detected/identified by other
> applications (PFA: rfc_cbor.jpg)
> It seems that the only way to inform other applications of the type as of now
> is using the extension (i.e. .cbor), or probably content detection (i.e. byte
> histogram distribution estimation).
> There is another thing worth the attention, it looks like tika has attempted
> to add the support with cbor mime detection in the tika-mimetypes.xml
> (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the
> cbor file dumped by CommonCrawlDataDumper.
> According to http://tools.ietf.org/html/rfc7049#section-2.4.5, there is a
> self-describing Tag 55799 that seems to be used for cbor type
> identification(the hex code might be 0xd9d9f7), but it is probably up to the
> application that take care of this tag, and it is also possible that the
> fasterxml that the nutch dumping tool is missing this tag, an example cbor
> file dumped by the Nutch tool i.e. CommonCrawlDataDumper has also been
> attached (PFA: 1424402690000.html).
> The following info is cited from the rfc, "...a decoder might be able to
> parse both CBOR and JSON.
> Such a decoder would need to mechanically distinguish the two
> formats. An easy way for an encoder to help the decoder would be to
> tag the entire CBOR item with tag 55799, the serialization of which
> will never be found at the beginning of a JSON text..."
> It looks like the a file can have two parts/sections i.e. the plain text
> parts and the json prettified by cbor, this might be also worth the attention
> and consideration with the parsing and type identification.
> On the other hand, it is worth noting that the entries for cbor extension
> detection needs to be appended in the tika-mimetypes.xml too
> e.g.
> <glob pattern="*.cbor"/>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)