IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no
pictures
----------------------------------------------------------------------------------
Key: TIKA-577
URL: https://issues.apache.org/jira/browse/TIKA-577
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.8
Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
Reporter: Dennis Adler
When cracking a Word 03 document (which, unfortunately, I cannot upload -- it
has client-confidential data -- an index out of bounds exception occurs in the
POI code used by the WordExtractor. To try to make up for the unavailable doc
file, I've included the resutls of a couple of hours stepping through the code
to find the failure point. The error occurs because point[0] = point[1] = 30;
upperbound of _paragraphs = 301. This is in the method
org.apache.poi.hwpf.usermodel.CharacterRun() .
The method + line numbers are:
public CharacterRun getCharacterRun(int index)
line 792: int[] point = findRange(_paragraphs, _parStart,
Math.max(chpx.getStart(), _start), chpx.getEnd());
line 794: PAPX papx = _paragraphs.get(point[0]); // <<< This is the
source of the exception
STACK at time of exception:
Range.GetCharacterRun(nit) line 794
PicturesTable.getAllPictures() line 191
WordExtractor$PicturesSource.<init>(HPWFDocument) line 429
WordExtractor$PicturesSource.<init>(HPWFDocument, WordExtractor#1) line 419
WordExtractor.parse(POIFSFileSystem, XHTMLContentHandler) line 75
OfficeParser.parse(CompositeParser).parse(InputStream, ContentHandler,
Metadata, ParseContext) line 187
DefaulttParser(CompositeParser).parse(InputStream, ContentHandler, Metadata,
ParseContext) line 197
AutoDetectParser(CompositeParser).parse(InputStream, ContentHandler, Metadata,
ParseContext) line 197
AutoDetectParser.parse(InputStream, ContentHandler, Metadata, ParseContext)
line 137
... (my project) ...
As noted, this occurs in a Word 2003 doc which has no pictures (it is a table);
147 character runs (0 - 146) found in first pass. Problem occurs on
first pass (not sure if there will be others) on this run. Last run in this
code section from org.apache.poi.hwpf.model.PicturesTable.getAllPictures(),
lines 186-191:
public List<Picture> getAllPictures() {
ArrayList<Picture> pictures = new ArrayList<Picture>();
Range range = _document.getOverallRange();
for (int i = 0; i < range.numCharacterRuns(); i++) {
CharacterRun run = range.getCharacterRun(i);
Error occurs on getCharacterRun(146) -- which is the last run in the range. If
I change point[0] to 300, the call returns nicely to
WordExtractor$PicturesSource.<init>(HPWFDocument) line 429, setting <all> to an
empty list. Fails again later on subsequent call to
getAllPictures with same error.
POTENTIAL FIX: if point[0] > papx.Length then return an EMPTY CharacterRun for
the paragraph in question.
Cannot send repro document - contains confidential client data.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.