[ 
https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134083#comment-15134083
 ] 

Tim Allison commented on TIKA-1854:
-----------------------------------

Will commit shortly.  Thank you for the patch and test case!

First, out of curiosity/ignorance, how will you make use of the storage class 
ids?  What do they actually mean?


bq. By the way, the Content-Type of the embedded document IS already available, 
but this only works for some popular formats (e.g. embedded MS Office 
documents).

I'm not sure what exactly you mean by "is already available"?  Do you mean that 
the MSOffice document often has a metadata item around an attachment which 
identifies the attachment's mime type and we therefore shouldn't bother running 
Tika's detection code on an embedded file?

bq. Is there a way for clients to configure the Content-Type detection for more 
exotic formats?

Do you mean generally for Tika or specifically within MSOffice docs based on 
the internally stored metadata around an attachment?  If generally for Tika, 
see e.g. [our documentation|https://tika.apache.org/1.11/parser_guide.html] or 
[so|http://stackoverflow.com/questions/30895761/how-to-add-new-mime-type-to-apache-tika].
  If you mean specifically within MSOffice, it would be great if you could 
submit another patch for the more exotic formats; but no, I don't think there 
is currently a way to configure that within MSOffice docs. :)



> Include the storage class ID of documents embedded in MS Office documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-1854
>                 URL: https://issues.apache.org/jira/browse/TIKA-1854
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>            Assignee: Tim Allison
>         Attachments: class-id.patch
>
>
> When processing embedded documents using an EmbeddedDocumentExtractor, the 
> storage class ID of the embedded document would be a useful metadata to have, 
> but it's currently missing.
> I'll promptly attach a patch implementing and testing this new feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to