Tim Allison created TIKA-2354:
---------------------------------

             Summary: Missing many embedded images in .doc files
                 Key: TIKA-2354
                 URL: https://issues.apache.org/jira/browse/TIKA-2354
             Project: Tika
          Issue Type: Bug
            Reporter: Tim Allison
            Priority: Blocker


On a slightly deeper look at the comparison results between 1.14 and trunk, it 
looks like we're missing quite a few embedded images from .doc files.  I 
initially thought these could be explained by different handling of macros, but 
that's not the issue.

I haven't traced the commit that did it (very likely my fault), but...
when we call this with a null character run.
{noformat}
        // Handle any pictures that we haven't output yet
        for (Picture p = pictures.nextUnclaimed(); p != null; ) {
            handlePictureCharacterRun(
                    null, p, pictures, xhtml
            );
            p = pictures.nextUnclaimed();
        }
{noformat}

the null character run then triggers skipping of the picture in this check 
because {{isRendered(cr)}} returns false if {{cr}} is {{null}}

{noformat}
        if (!isRendered(cr) || picture == null) {
            // Oh dear, we've run out...
            // Probably caused by multiple \u0008 images referencing
            //  the same real image
            return;
        }
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to