COSStream doesn't actually stream tokens, causing OOM in larger PDF text
extraction
-----------------------------------------------------------------------------------
Key: PDFBOX-695
URL: https://issues.apache.org/jira/browse/PDFBOX-695
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.2.0
Environment: All
Reporter: Kyle Maxwell
Attachments: pdfbox-oom-against-935604.patch
Text extraction of certain pdfs has been hanging and/or OOMing. Profiling
revealed that PDFStreamEngine.processSubStream() eventually calls
PDFStreamParser.getTokens(), which assembles an ArrayList of Tokens. In some
cases, this can use over 1GB of memory.
The attached patch replaces PDFStreamParser.getTokens() with
PDFStreamParser.getTokensIterator(), which streams the tokens, avoiding the
ArrayList build. It only uses this in the call path of
org.apache.pdfbox.ExtractText, so the fix may not benefit other usages. Also,
API used by the fix may not be ideal.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.