Nick Burch created TIKA-1648:
--------------------------------
Summary: Investigate Word .doc WMF/EMF/PICT attachmetns
Key: TIKA-1648
URL: https://issues.apache.org/jira/browse/TIKA-1648
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.9
Reporter: Nick Burch
As spotted when working on TIKA-1644, many of the govdocs1 Word .doc files have
embedded image resources which are coming through as WMF, EMF or PICT. In at
least some of the cases, these files don't have the typical header that would
be expected for that file, but do have PDF header some tens or a few hundred
bytes into the file. (Some of the files do come out correctly though, so it
doesn't look universal)
It's possible that this is all as expected and normal. However, it's possible
that something in the POI code for pulling out the embedded resources is either
truncating or failing to truncate the header, or some how otherwise failing to
correctly pull these out. The result is that they aren't coming through quite
as they should do as embedded resources.
This is probably going to mean lots of time with the file format specs, some
time creating some slightly-unusual test files with these formats of
attachments in, then finally looking at the govdocs ones
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)