[jira] [Commented] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents

Tim Allison (JIRA) Fri, 05 Feb 2016 06:30:20 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134252#comment-15134252
 ]


Tim Allison commented on TIKA-1854:
-----------------------------------

Got it.  This is very helpful.  Thank you.

bq.  Is the same mechanism used to determine the mime type of the embedded 
documents?

Y, sometimes it is.  Depending on the container mime type and the embedded doc 
type, we rely on what the container document tells us the embedded file is, and 
sometimes we run the same mime type detection algorithms on the embedded file 
bytes as we do on the container files.

bq. If custom mime types worked for embedded documents that could also be 
useful.

They do.  Or, they should... let us know if you find that not working!!!

bq. I think the specific formats I'm interested in are not in widespread use

Ok, y, this makes sense.  Given our goal of the Babel fish, though, I wonder if 
we wouldn't want to add whatever you're working on?

On a related note, would there be any utility in adding a detector that checks 
for the storageClassId in the Metadata object and then returns a mime-type 
based on a configurable lookup list?

> Include the storage class ID of documents embedded in MS Office documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-1854
>                 URL: https://issues.apache.org/jira/browse/TIKA-1854
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>            Assignee: Tim Allison
>         Attachments: class-id.patch
>
>
> When processing embedded documents using an EmbeddedDocumentExtractor, the 
> storage class ID of the embedded document would be a useful metadata to have, 
> but it's currently missing.
> I'll promptly attach a patch implementing and testing this new feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents

Reply via email to