[
https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752209#comment-13752209
]
Daniel Bonniot de Ruisselet edited comment on TIKA-1167 at 8/28/13 11:47 AM:
-----------------------------------------------------------------------------
After further analysis, I think support for such cases probably needs to be
done in POI (but comments welcome if someone has further insight). I posted
comments and tentative a patch to this POI bug:
https://issues.apache.org/bugzilla/show_bug.cgi?id=51891
Even if that works out well, it would probably be useful to add a test at the
Tika level as well. The OLE parsing seems rather sensitive (for a reason, the
format itself looks messy and poorly documented). Also, integration of POI and
Tika is seems tight. So it can only help to test things work at different
levels.
was (Author: dbr):
After further analysis, I think support for such cases probably needs to be
done in POI (but comments welcome if someone has further insight). I'm working
on submitting an issue and probably a tentative a patch there. Will link to it
here when it exists.
Even if that works out well, it would probably be useful to add a test at the
Tika level as well. The OLE parsing seems rather sensitive (for a reason, the
format itself looks messy and poorly documented). Also, integration of POI and
Tika is seems tight. So it can only help to test things work at different
levels.
> Embedded object not extracted
> -----------------------------
>
> Key: TIKA-1167
> URL: https://issues.apache.org/jira/browse/TIKA-1167
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Daniel Bonniot de Ruisselet
> Priority: Critical
> Fix For: 1.5
>
> Attachments: Doc w Structure that wont extract.docx
>
>
> For the attached docx, tika seems to detect the embedded object, as shown by
> this tag:
> {{<div class="embedded" id="rId10"/>}}
> However, extraction itself (using -z on the command line, or using the API)
> does not seem to work for this object:
> {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}}
> {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to
> /tmp/tika/rId9_image1.wmf}}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira