[ 
https://issues.apache.org/jira/browse/PDFBOX-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737887#comment-13737887
 ] 

Tilman Hausherr commented on PDFBOX-1692:
-----------------------------------------

There's a problem with the EOF handling in CMapParser.java. In 
parseNextToken(), at "case -1", the -1 is returned and no exception is thrown. 
In parse(), at line 217 there's a loop from startCode to endCode. For the 
"problem" file, startCode is 0 0 and endCode is 255 255, so the loop will be 
run 65536 times. And even that will happen many times, until memory is out. 
Even throwing an exception doesn't help (I tried, but parse() is called again).
                
> java.lang.OutOfMemoryError: Java heap space
> -------------------------------------------
>
>                 Key: PDFBOX-1692
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1692
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.2
>         Environment: Windows 7
> java version 1.7.0_17 (build 1.7.0_17-b02/64-Bit Server VM build 23.7-01)
> pdfbox-app-1.8.2.jar
>            Reporter: Christian Czech
>         Attachments: test_1fd9a_test.pdf
>
>
> Hello,
> I have a problem with text extraction.
> The problem is not enough memory in VM during the text extraction!
> My Code:
> String pdfFile = "D:\testfolder\test1fd9a_test.pdf"; //size of file 168 KB
> PDDocument document = PDDocument.load(pdfFile, true);
> PDFTextStripper stripper = null;
> try {
> stripper = new PDFTextStripper();
> stripper.setSortByPosition(true);
> stripper.writeText(document, outputWriter);
> } catch () {
> }
> You get an error:
> java.lang.OutOfMemoryError: Java heap space 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to