[jira] [Commented] (PDFBOX-2445) Out of Memory - Extract text for Apache_Solr_4.7_Ref_Guide.pdf

Michael Goddard (JIRA) Wed, 22 Oct 2014 02:58:04 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179774#comment-14179774
 ]


Michael Goddard commented on PDFBOX-2445:
-----------------------------------------

On a project using Apache Tika 1.6, we hit this issue with a particular PDF 
file. To check I attempted to extract text using PDFBox alone, as shown below, 
and observe the same issue. Adobe Acrobat Reader is able to extract the text 
from this document. Are there any thoughts on how to solve for this other than 
avoiding certain problematic PDF files?

Here's what I observed:

[Downloads]$ java -Xmx1g -jar pdfbox-app-1.8.7.jar ExtractText -console 
-encoding UTF-8 ./Apache_Solr_4.7_Ref_Guide.pdf
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.AbstractCollection.toArray(AbstractCollection.java:136)
at java.util.ArrayList.<init>(ArrayList.java:168)
at org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:534)
at org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:591)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:258)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1233)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1198)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1123)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:212)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
Oct 22, 2014 5:04:13 AM org.apache.pdfbox.cos.COSDocument finalize
WARNING: Warning: You did not close a PDF Document

I couldn't find the "upload" button here, so uploaded this PDF to S3 along with 
the text produced by Adobe Acrobat Reader:

https://s3.amazonaws.com/goddard.public/Apache_Solr_4.7_Ref_Guide.pdf
https://s3.amazonaws.com/goddard.public/Apache_Solr_4.7_Ref_Guide.txt

Also, I attempted to use the non-sequential PDFBox parser from my code which 
uses Tika, but this didn't solve the problem:

PDFParserConfig pdfParserConfig = new PDFParserConfig();
pdfParserConfig.setUseNonSequentialParser(true);
context.set(PDFParserConfig.class, pdfParserConfig);


> Out of Memory - Extract text for Apache_Solr_4.7_Ref_Guide.pdf
> --------------------------------------------------------------
>
>                 Key: PDFBOX-2445
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2445
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.7, 2.0.0
>            Reporter: Maruan Sahyoun
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PDFBOX-2445) Out of Memory - Extract text for Apache_Solr_4.7_Ref_Guide.pdf

Reply via email to