[
https://issues.apache.org/jira/browse/PDFBOX-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179774#comment-14179774
]
Michael Goddard commented on PDFBOX-2445:
-----------------------------------------
On a project using Apache Tika 1.6, we hit this issue with a particular PDF
file. To check I attempted to extract text using PDFBox alone, as shown below,
and observe the same issue. Adobe Acrobat Reader is able to extract the text
from this document. Are there any thoughts on how to solve for this other than
avoiding certain problematic PDF files?
Here's what I observed:
[Downloads]$ java -Xmx1g -jar pdfbox-app-1.8.7.jar ExtractText -console
-encoding UTF-8 ./Apache_Solr_4.7_Ref_Guide.pdf
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.AbstractCollection.toArray(AbstractCollection.java:136)
at java.util.ArrayList.<init>(ArrayList.java:168)
at org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:534)
at org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:591)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:258)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1233)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1198)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1123)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:212)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
Oct 22, 2014 5:04:13 AM org.apache.pdfbox.cos.COSDocument finalize
WARNING: Warning: You did not close a PDF Document
I couldn't find the "upload" button here, so uploaded this PDF to S3 along with
the text produced by Adobe Acrobat Reader:
https://s3.amazonaws.com/goddard.public/Apache_Solr_4.7_Ref_Guide.pdf
https://s3.amazonaws.com/goddard.public/Apache_Solr_4.7_Ref_Guide.txt
Also, I attempted to use the non-sequential PDFBox parser from my code which
uses Tika, but this didn't solve the problem:
PDFParserConfig pdfParserConfig = new PDFParserConfig();
pdfParserConfig.setUseNonSequentialParser(true);
context.set(PDFParserConfig.class, pdfParserConfig);
> Out of Memory - Extract text for Apache_Solr_4.7_Ref_Guide.pdf
> --------------------------------------------------------------
>
> Key: PDFBOX-2445
> URL: https://issues.apache.org/jira/browse/PDFBOX-2445
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.8.7, 2.0.0
> Reporter: Maruan Sahyoun
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)