[jira] [Comment Edited] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents

Daniel Bonniot de Ruisselet (JIRA) Fri, 05 Feb 2016 05:48:20 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134180#comment-15134180
 ]


Daniel Bonniot de Ruisselet edited comment on TIKA-1854 at 2/5/16 1:47 PM:
---------------------------------------------------------------------------

The documents I'm processing sometimes have embedded files representing 
chemical structures. Having the storage class IDs allow me to know which 
embedded documents contain chemical structures, and to know in which format 
they are, so that I can process them accordingly.

What I mean about the content type is that metadata.get(Metadata.CONTENT_TYPE) 
already returns for instance "application/vnd.ms-excel" for embedded excel 
documents. However it is not populated for chemical or other formats.

I might be mistaken, but it seems to me that the documentation you linked to is 
about the mime type of the main (container) document. Is the same mechanism 
used to determine the mime type of the embedded documents?

I think the specific formats I'm interested in are not in widespread use, so 
for contributions to Tika I'm rather focused on a generic solution. Getting the 
storage class IDs will definitely be useful in such cases. If custom mime types 
worked for embedded documents that could also be useful.



was (Author: dbr):
The documents I'm processing sometimes have embedded files representing 
chemical structures. Having the storage class IDs allow me to know which 
embedded chemical structures, and to know in which format they are, so that I 
can process them accordingly.

What I mean about the content type is that metadata.get(Metadata.CONTENT_TYPE) 
already returns for instance "application/vnd.ms-excel" for embedded excel 
documents. However it is not populated for chemical or other formats.

I might be mistaken, but it seems to me that the documentation you linked to is 
about the mime type of the main (container) document. Is the same mechanism 
used to determine the mime type of the embedded documents?

I think the specific formats I'm interested in are not in widespread use, so 
for contributions to Tika I'm rather focused on a generic solution. Getting the 
storage class IDs will definitely be useful in such cases. If custom mime types 
worked for embedded documents that could also be useful.


> Include the storage class ID of documents embedded in MS Office documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-1854
>                 URL: https://issues.apache.org/jira/browse/TIKA-1854
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Daniel Bonniot de Ruisselet
>            Assignee: Tim Allison
>         Attachments: class-id.patch
>
>
> When processing embedded documents using an EmbeddedDocumentExtractor, the 
> storage class ID of the embedded document would be a useful metadata to have, 
> but it's currently missing.
> I'll promptly attach a patch implementing and testing this new feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1854) Include the storage class ID of documents embedded in MS Office documents

Reply via email to