[ 
https://issues.apache.org/jira/browse/TIKA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke sh updated TIKA-1610:
--------------------------
    Attachment: 1424402690000.html

cbor file dumped by the nutch tool.

> CBOR Parser and detection improvement
> -------------------------------------
>
>                 Key: TIKA-1610
>                 URL: https://issues.apache.org/jira/browse/TIKA-1610
>             Project: Tika
>          Issue Type: New Feature
>          Components: detector, mime, parser
>    Affects Versions: 1.7
>            Reporter: Luke sh
>            Priority: Trivial
>              Labels: memex
>         Attachments: 1424402690000.html, cbor_tika.mimetypes.xml.jpg, 
> rfc_cbor.jpg
>
>
> CBOR is a data format whose design goals include the possibility of extremely 
> small code size, fairly small message size, and extensibility without the 
> need for version negotiation (cited from http://cbor.io/ ).
> It would be great if Tika is able to provide the support with CBOR parser and 
> identification. In the current project with Nutch, the Nutch 
> CommonCrawlDataDumper is used to dump the crawled segments to the files in 
> the format of CBOR. In order to read/parse those dumped files by this tool, 
> it would be great if tika is able to support parsing the cbor, the thing is 
> that the CommonCrawlDataDumper is not dumping with correct extension, it 
> dumps with its own rule, the default extension of the dumped file is html, so 
> it might be less painful if tika is able to detect and parse those files 
> without any pre-processing steps. 
> CommonCrawlDataDumper is calling the following to dump with cbor.
> import com.fasterxml.jackson.dataformat.cbor.CBORFactory;
> import com.fasterxml.jackson.dataformat.cbor.CBORGenerator;
> fasterxml is a 3rd party library for converting json to .cbor and Vice Versa.
> According to RFC 7049 (http://tools.ietf.org/html/rfc7049), it looks like 
> CBOR does not yet have its magic numbers to be detected/identified by other 
> applications (PFA: rfc_cbor.jpg)
> It seems that the only way to inform other applications of the type as of now 
> is using the extension (i.e. .cbor), or probably content detection (i.e. byte 
> histogram distribution estimation).  
> There is another thing worth the attention, it looks like tika has attempted 
> to add the support with cbor mime detection in the tika-mimetypes.xml 
> (PFA:cbor_tika.mimetypes.xml.jpg); This detection is not working with the 
> cbor file dumped by CommonCrawlDataDumper. 
> According to http://tools.ietf.org/html/rfc7049, there is a self-describing 
> Tag 55799 that seems to be used for cbor type identification, but it is 
> probably up to the application that take care of this tag, and it is also 
> possible that the fasterxml is missing this tag, an example cbor file dumped 
> by the Nutch tool i.e. CommonCrawlDataDumper has also been attached (PFA: 
> 1424402690000.html)
> On the other hand, it is worth noting that the entries for cbor extension 
> detection needs to be appended in the tika-mimetypes.xml too 
> <glob pattern="*.cbor"/>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to