[ 
https://issues.apache.org/jira/browse/TIKA-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752209#comment-13752209
 ] 

Daniel Bonniot de Ruisselet edited comment on TIKA-1167 at 8/28/13 11:47 AM:
-----------------------------------------------------------------------------

After further analysis, I think support for such cases probably needs to be 
done in POI (but comments welcome if someone has further insight). I posted 
comments and tentative a patch to this POI bug: 
https://issues.apache.org/bugzilla/show_bug.cgi?id=51891

Even if that works out well, it would probably be useful to add a test at the 
Tika level as well. The OLE parsing seems rather sensitive (for a reason, the 
format itself looks messy and poorly documented). Also, integration of POI and 
Tika is seems tight. So it can only help to test things work at different 
levels.
                
      was (Author: dbr):
    After further analysis, I think support for such cases probably needs to be 
done in POI (but comments welcome if someone has further insight). I'm working 
on submitting an issue and probably a tentative a patch there. Will link to it 
here when it exists.

Even if that works out well, it would probably be useful to add a test at the 
Tika level as well. The OLE parsing seems rather sensitive (for a reason, the 
format itself looks messy and poorly documented). Also, integration of POI and 
Tika is seems tight. So it can only help to test things work at different 
levels.
                  
> Embedded object not extracted
> -----------------------------
>
>                 Key: TIKA-1167
>                 URL: https://issues.apache.org/jira/browse/TIKA-1167
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Daniel Bonniot de Ruisselet
>            Priority: Critical
>             Fix For: 1.5
>
>         Attachments: Doc w Structure that wont extract.docx
>
>
> For the attached docx, tika seems to detect the embedded object, as shown by 
> this tag:
> {{<div class="embedded" id="rId10"/>}}
> However, extraction itself (using -z on the command line, or using the API) 
> does not seem to work for this object:
> {{java -jar tika-app-1.4.jar -z Doc\ w\ Structure\ that\ wont\ extract.docx}}
> {{Extracting 'rId9_image1.wmf' (application/x-msmetafile) to 
> /tmp/tika/rId9_image1.wmf}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to