COSStream doesn't actually stream tokens, causing OOM in larger PDF text 
extraction
-----------------------------------------------------------------------------------

                 Key: PDFBOX-695
                 URL: https://issues.apache.org/jira/browse/PDFBOX-695
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.2.0
         Environment: All
            Reporter: Kyle Maxwell
         Attachments: pdfbox-oom-against-935604.patch

Text extraction of certain pdfs has been hanging and/or OOMing.  Profiling 
revealed that PDFStreamEngine.processSubStream() eventually calls 
PDFStreamParser.getTokens(), which assembles an ArrayList of Tokens.  In some 
cases, this can use over 1GB of memory.

The attached patch replaces PDFStreamParser.getTokens() with 
PDFStreamParser.getTokensIterator(), which streams the tokens, avoiding the 
ArrayList build.  It only uses this in the call path of 
org.apache.pdfbox.ExtractText, so the fix may not benefit other usages.  Also, 
API used by the fix may not be ideal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to