https://issues.apache.org/bugzilla/show_bug.cgi?id=51944
Bug #: 51944
Summary: [PATCH] PAPFormattedDiskPage.getPAPX -
IndexOutOfBounds
Product: POI
Version: 3.8-dev
Platform: PC
Status: NEW
Severity: normal
Priority: P2
Component: HWPF
AssignedTo: [email protected]
ReportedBy: [email protected]
Classification: Unclassified
Created attachment 27681
--> https://issues.apache.org/bugzilla/attachment.cgi?id=27681
Patch of two files
A handful of word97-2003 (though could be even earlier) documents produce an
ArrayOutOfBoundsException stemming from the OldPAPBinTable.<init>. These
documents may also have additional encodings included from the Hong Kong
region. (Unable to include sample due to sensitive nature of documents)
Essentially the parsing of the file thinks there are more elements in the array
than are present. The patch included prevents the error by including a public
member call within PAPFormattedDiskPage to return the actual size of the
_papxList property. The initial usage of pfkp.size() within
OldPAPBinTable<init> does not seem to always be accurate.
Stack Trace (Daily Build)
----------------------------------------------------------------
Caused by: java.lang.IndexOutOfBoundsException: Index: 36, Size: 36
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.poi.hwpf.model.PAPFormattedDiskPage.getPAPX(PAPFormattedDiskPage.java:145)
at org.apache.poi.hwpf.model.OldPAPBinTable.<init>(OldPAPBinTable.java:58)
at org.apache.poi.hwpf.HWPFOldDocument.<init>(HWPFOldDocument.java:108)
at
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:410)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:69)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 60 more
The supplied patch allows these documents to be extracted, however there do
appear to be additional underlying issues at play. (I've only tested more
indepth on 2 of the 12 failed ones). One document ended up being truncated mid
file and had random newline and carriage returns inserted. The other appeared
to have repetition of some paragraphs added and additional newlines added,
although all of the documents text was extracted.
Due to this, I'm not sure if you'd want to go forward with the patch. Though
I'd think getting as much text out as possible would be preferable than to no
text at all.
Thanks in advance,
Jeremy
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]