[ 
https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570208#comment-13570208
 ] 

Michael McCandless commented on TIKA-1072:
------------------------------------------

OK I did some digging on this.  The DirectoryNode of this embedded document has 
these entries:
{noformat}
ent=PICT size=797
ent=ObjInfo size=4
ent=Ole10Native size=40
ent=Ole10FmtProgID size=13
ent=OlePres000 size=40
ent=CompObj size=82
ent=PIC size=100
ent=META size=582
ent=Ole size=20
{noformat}

And so I believe it really is an OLE10Native record... OLE10Native then tries 
to parse it, with plain=false, but then runs out of bytes on this line:
{noformat}
      flags2 = LittleEndian.getShort(data, ofs);
{noformat}

It seems likely something is corrupt about this entry?  Does 40 bytes seem way 
too small for an OLE10Native entry? If so, I wonder if we could fix 
AbstractPOIFSExtractor to log the exception and then skip this one embedded 
document and then go on to parsing the others?  Ie, isolate the exception, 
rather than aborting the entire extraction; in this case the main document 
extracts fine.
                
> AIOOBE when handling embedded document in .doc file
> ---------------------------------------------------
>
>                 Key: TIKA-1072
>                 URL: https://issues.apache.org/jira/browse/TIKA-1072
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 1.4
>
>         Attachments: 20-Force-on-a-current-S00.doc
>
>
> I have a Word (.doc) document that hits an exception when I run:
> {noformat}
> java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar 
> /x/tmp/20-Force-on-a-current-S00.doc 
> {noformat}
> Here's the exception:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
>       at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
>       at 
> org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:139)
>       at 
> org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
>       at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
>       at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> {noformat}
> It happens when we try to parse an OLE10 embedded object ... the code
> that does this parsing captures and ignores Ole10NativeException and
> skips the entry ... so I'm wondering if we should also catch AIOOBE
> and skip the entry?  Ie, maybe this entry really is not OLE10, and the
> Ole10Native code is failing to throw Ole10NativeException for it?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to