Nick Burch created TIKA-1648:
--------------------------------

             Summary: Investigate Word .doc WMF/EMF/PICT attachmetns
                 Key: TIKA-1648
                 URL: https://issues.apache.org/jira/browse/TIKA-1648
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.9
            Reporter: Nick Burch


As spotted when working on TIKA-1644, many of the govdocs1 Word .doc files have 
embedded image resources which are coming through as WMF, EMF or PICT. In at 
least some of the cases, these files don't have the typical header that would 
be expected for that file, but do have PDF header some tens or a few hundred 
bytes into the file. (Some of the files do come out correctly though, so it 
doesn't look universal)

It's possible that this is all as expected and normal. However, it's possible 
that something in the POI code for pulling out the embedded resources is either 
truncating or failing to truncate the header, or some how otherwise failing to 
correctly pull these out. The result is that they aren't coming through quite 
as they should do as embedded resources.

This is probably going to mean lots of time with the file format specs, some 
time creating some slightly-unusual test files with these formats of 
attachments in, then finally looking at the govdocs ones



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to