DO NOT REPLY [Bug 51944] New: [PATCH] PAPFormattedDiskPage.getPAPX - IndexOutOfBounds

bugzilla Mon, 03 Oct 2011 16:27:54 -0700

https://issues.apache.org/bugzilla/show_bug.cgi?id=51944


             Bug #: 51944
           Summary: [PATCH] PAPFormattedDiskPage.getPAPX -
                    IndexOutOfBounds
           Product: POI
           Version: 3.8-dev
          Platform: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HWPF
        AssignedTo: [email protected]
        ReportedBy: [email protected]
    Classification: Unclassified


Created attachment 27681
  --> https://issues.apache.org/bugzilla/attachment.cgi?id=27681
Patch of two files

A handful of word97-2003 (though could be even earlier) documents produce an
ArrayOutOfBoundsException stemming from the OldPAPBinTable.<init>. These
documents may also have additional encodings included from the Hong Kong
region. (Unable to include sample due to sensitive nature of documents)

Essentially the parsing of the file thinks there are more elements in the array
than are present.  The patch included prevents the error by including a public
member call within PAPFormattedDiskPage to return the actual size of the
_papxList property.  The initial usage of pfkp.size() within
OldPAPBinTable<init> does not seem to always be accurate.

Stack Trace (Daily Build)
----------------------------------------------------------------
Caused by: java.lang.IndexOutOfBoundsException: Index: 36, Size: 36
    at java.util.ArrayList.RangeCheck(ArrayList.java:547)
    at java.util.ArrayList.get(ArrayList.java:322)
    at
org.apache.poi.hwpf.model.PAPFormattedDiskPage.getPAPX(PAPFormattedDiskPage.java:145)
    at org.apache.poi.hwpf.model.OldPAPBinTable.<init>(OldPAPBinTable.java:58)
    at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:108)
    at
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:410)
    at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:69)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    ... 60 more


The supplied patch allows these documents to be extracted, however there do
appear to be additional underlying issues at play.  (I've only tested more
indepth on 2 of the 12 failed ones).  One document ended up being truncated mid
file and had random newline and carriage returns inserted.  The other appeared
to have repetition of some paragraphs added and additional newlines added,
although all of the documents text was extracted.

Due to this, I'm not sure if you'd want to go forward with the patch.  Though
I'd think getting as much text out as possible would be preferable than to no
text at all.

Thanks in advance,

Jeremy

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

DO NOT REPLY [Bug 51944] New: [PATCH] PAPFormattedDiskPage.getPAPX - IndexOutOfBounds

Reply via email to