Eric Leleu created PDFBOX-1639:
----------------------------------
Summary: Infinite loop with PDFParser used by tika.
Key: PDFBOX-1639
URL: https://issues.apache.org/jira/browse/PDFBOX-1639
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 1.8.2, 1.7.1, 2.0.0
Reporter: Eric Leleu
Assignee: Eric Leleu
Hi,
I encountered an issue in a production environment that cause a disk full
error. :(
Tika uses the PDFParser with the "forceParsing" boolean set to true in order to
continue the parsing even if an error occurs.
Two PDFs have an object number greater than the max int value so the readInt()
method fails.
Due to the "forceParsing" boolean, the parser try to go to the next object but
it can't because on error the readInt method backtrack the read bytes and so
the "skipToNextObj" method does nothing and we try to parse the same object
indefinitely...
The COSObjectKey object already uses a long as object numder, so we should read
a long instead of an integer during the parsing process using a "readLong"
method to manage too large objects numbers.
Are you agreed with that ?
BR,
Eric
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira