Tim Allison created TIKA-2354:
---------------------------------
Summary: Missing many embedded images in .doc files
Key: TIKA-2354
URL: https://issues.apache.org/jira/browse/TIKA-2354
Project: Tika
Issue Type: Bug
Reporter: Tim Allison
Priority: Blocker
On a slightly deeper look at the comparison results between 1.14 and trunk, it
looks like we're missing quite a few embedded images from .doc files. I
initially thought these could be explained by different handling of macros, but
that's not the issue.
I haven't traced the commit that did it (very likely my fault), but...
when we call this with a null character run.
{noformat}
// Handle any pictures that we haven't output yet
for (Picture p = pictures.nextUnclaimed(); p != null; ) {
handlePictureCharacterRun(
null, p, pictures, xhtml
);
p = pictures.nextUnclaimed();
}
{noformat}
the null character run then triggers skipping of the picture in this check
because {{isRendered(cr)}} returns false if {{cr}} is {{null}}
{noformat}
if (!isRendered(cr) || picture == null) {
// Oh dear, we've run out...
// Probably caused by multiple \u0008 images referencing
// the same real image
return;
}
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)