[jira] [Commented] (TIKA-489) Embedded Documents within documents

Jeremy Anderson (JIRA) Tue, 30 Aug 2011 09:28:01 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093869#comment-13093869
 ]


Jeremy Anderson commented on TIKA-489:
--------------------------------------

Thanks For your prompt follow-up.  I did some more testing this morning and 
think I was able to further narrow down the issue.  Currently there appear to 
be issues with embedded pdf's and outlook Msg files contained in MS Word 
documents.  I'll attach a sample for each and my recursive parser (incase the 
problem lies in there).

>From what I see, when these embedded objects are parsed, they're initially 
>identified as vnd.openxmlformats-officedocument.oleObject in the metadata's 
>Content-Type field.  After a call to the RecurciveParsers super parse class 
>the Content-Types update to the following:

PDF's:  application/vnd.ms-works
.MSG:   application/x-tika-msoffice

I believe this may be the root cause why no text is being extracted for these 
type of embedded files.

Out of curiosity, is there any plans in future releases of Tika to handle the 
parsing of embedded documents in a manner similar to the way attachments 
currently are parsed from outlook .msg files?  It's pretty nice that out of the 
box the .msg parser automatically appends any text found in attachements to the 
bottom of the original message text.  A similar feature for embedded documents 
may be a useful improvement.

Thanks in advance for your attention to the issue.

> Embedded Documents within documents
> -----------------------------------
>
>                 Key: TIKA-489
>                 URL: https://issues.apache.org/jira/browse/TIKA-489
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>         Environment: All
>            Reporter: Manish
>            Priority: Trivial
>             Fix For: 0.9
>
>         Attachments: doc1.doc
>
>
> If there are embedded documents(objects) without word files, those are not 
> getting parsed. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-489) Embedded Documents within documents

Reply via email to