[
https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134180#comment-15134180
]
Daniel Bonniot de Ruisselet edited comment on TIKA-1854 at 2/5/16 1:47 PM:
---------------------------------------------------------------------------
The documents I'm processing sometimes have embedded files representing
chemical structures. Having the storage class IDs allow me to know which
embedded documents contain chemical structures, and to know in which format
they are, so that I can process them accordingly.
What I mean about the content type is that metadata.get(Metadata.CONTENT_TYPE)
already returns for instance "application/vnd.ms-excel" for embedded excel
documents. However it is not populated for chemical or other formats.
I might be mistaken, but it seems to me that the documentation you linked to is
about the mime type of the main (container) document. Is the same mechanism
used to determine the mime type of the embedded documents?
I think the specific formats I'm interested in are not in widespread use, so
for contributions to Tika I'm rather focused on a generic solution. Getting the
storage class IDs will definitely be useful in such cases. If custom mime types
worked for embedded documents that could also be useful.
was (Author: dbr):
The documents I'm processing sometimes have embedded files representing
chemical structures. Having the storage class IDs allow me to know which
embedded chemical structures, and to know in which format they are, so that I
can process them accordingly.
What I mean about the content type is that metadata.get(Metadata.CONTENT_TYPE)
already returns for instance "application/vnd.ms-excel" for embedded excel
documents. However it is not populated for chemical or other formats.
I might be mistaken, but it seems to me that the documentation you linked to is
about the mime type of the main (container) document. Is the same mechanism
used to determine the mime type of the embedded documents?
I think the specific formats I'm interested in are not in widespread use, so
for contributions to Tika I'm rather focused on a generic solution. Getting the
storage class IDs will definitely be useful in such cases. If custom mime types
worked for embedded documents that could also be useful.
> Include the storage class ID of documents embedded in MS Office documents
> -------------------------------------------------------------------------
>
> Key: TIKA-1854
> URL: https://issues.apache.org/jira/browse/TIKA-1854
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Daniel Bonniot de Ruisselet
> Assignee: Tim Allison
> Attachments: class-id.patch
>
>
> When processing embedded documents using an EmbeddedDocumentExtractor, the
> storage class ID of the embedded document would be a useful metadata to have,
> but it's currently missing.
> I'll promptly attach a patch implementing and testing this new feature.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)