Michael McCandless created TIKA-987:
---------------------------------------
Summary: Embedded drawing (SHAPE MERGEFORMAT) sometimes not
extracted
Key: TIKA-987
URL: https://issues.apache.org/jira/browse/TIKA-987
Project: Tika
Issue Type: Bug
Reporter: Michael McCandless
Fix For: 1.3
I have two Word docs, both containing the same drawing, but one has
text added.
In one case (picture.doc) the extraction is correct: it contains only
an embedded image.wmf; when I view the image it's correct.
In the second case (picture_3.doc) the picture is extracted as image
(no extension), and is 0 bytes, and there is an invalid character
(mapped to unicode replacement char) inserted before the image:
{noformat}
<title/>
</head>
<body><p>�<img src="embedded:image1" alt="image1"/></p>
<p/>
<p/>
<p>vehicle
</p>
{noformat}
(Though, the text "vehicle" is extracted correctly).
I dug a bit, and with the 2nd doc there is an embedded {SHAPE *
MERGEFORMAT} field, which we invoke
WordExtractor.handleSpecialCharacterRuns on, and somehow it extracts
the 0-byte no-extension image as well as the invalid character. With
the first doc there is no field (at least not one that's handle with
handleSpecialCharacterRuns...). Otherwise I'm not sure how to
fix... it could be something is going wrong in how POI parses the
Pictures from PictureSource.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira