[
https://issues.apache.org/jira/browse/TIKA-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093869#comment-13093869
]
Jeremy Anderson commented on TIKA-489:
--------------------------------------
Thanks For your prompt follow-up. I did some more testing this morning and
think I was able to further narrow down the issue. Currently there appear to
be issues with embedded pdf's and outlook Msg files contained in MS Word
documents. I'll attach a sample for each and my recursive parser (incase the
problem lies in there).
>From what I see, when these embedded objects are parsed, they're initially
>identified as vnd.openxmlformats-officedocument.oleObject in the metadata's
>Content-Type field. After a call to the RecurciveParsers super parse class
>the Content-Types update to the following:
PDF's: application/vnd.ms-works
.MSG: application/x-tika-msoffice
I believe this may be the root cause why no text is being extracted for these
type of embedded files.
Out of curiosity, is there any plans in future releases of Tika to handle the
parsing of embedded documents in a manner similar to the way attachments
currently are parsed from outlook .msg files? It's pretty nice that out of the
box the .msg parser automatically appends any text found in attachements to the
bottom of the original message text. A similar feature for embedded documents
may be a useful improvement.
Thanks in advance for your attention to the issue.
> Embedded Documents within documents
> -----------------------------------
>
> Key: TIKA-489
> URL: https://issues.apache.org/jira/browse/TIKA-489
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 0.7
> Environment: All
> Reporter: Manish
> Priority: Trivial
> Fix For: 0.9
>
> Attachments: doc1.doc
>
>
> If there are embedded documents(objects) without word files, those are not
> getting parsed.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira