[jira] [Commented] (PDFBOX-1757) Errors parsing/extracting text from a PDF

Timo Boehme (JIRA) Mon, 04 Nov 2013 02:53:34 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812744#comment-13812744
 ]


Timo Boehme commented on PDFBOX-1757:
-------------------------------------

Ok, I've analyzed 
http://digitalcorpora.org/corp/nps/files/govdocs1/020/020747.pdf

The document is simply broken. The startxref at the end points to address 173. 
However the xref starts at 172 thus we see ref (x is missing). This can be seen 
if you use PDDocument.loadNonSeq which results in exception
java.io.IOException: Error: Expected a long type, actual='ref'
because it does not find 'xref' and tries to read it as object stream starting 
with an object number.

The other exception you get from the standard sequential parser via 
PDDocument.load (Value is not an integer: 636121514401477526485946144)  is a 
result from limited parsing capability of this parser (you really should use 
'loadNonSeq').

So this is all about relaxing parsing by 'repairing' broken documents - a 
feature request.
The reason other programs can handle the file might be that they either do such 
'repair' or use the 'Linearized' information for document parsing so they do 
not need the startxref value.

> Errors parsing/extracting text from a PDF
> -----------------------------------------
>
>                 Key: PDFBOX-1757
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1757
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.2
>         Environment: Ubuntu Linux & Windows 7 (both JDK6)
>            Reporter: William Palmer
>            Priority: Minor
>
> I am trying to extract text from PDFs.  Extracting text from the test file 
> http://digitalcorpora.org/corp/nps/files/govdocs1/020/020747.pdf causes 
> exceptions to be thrown.
> The first:
> Exception in thread "main" java.lang.RuntimeException: java.io.IOException: 
> Value is not an integer: 636121514401477526485946144
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:187)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:194)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
> Caused by: java.io.IOException: Value is not an integer: 
> 636121514401477526485946144
>       at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:104)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:351)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:182)
> Code to cause above exception:
> PDFTextStripper ts = new PDFTextStripper();
> PrintWriter out = new PrintWriter(new FileWriter(new File ("020747.txt")));
> PDDocument doc = PDDocument.load(new File("020747.pdf").toURI().toURL(), 
> true);
> ts.setForceParsing(true);
> ts.writeText(doc, out);
> Using the following code causes a different exception until 
> org.apache.pdfbox.baseParser.pushBackSize is increased (only tested 1024768). 
>  After it is increased I get basically the same exception as above
> PrintWriter out = new PrintWriter(new FileWriter(new File("020747.txt")));
> PDFParser parser = new PDFParser(new FileInputStream(new File("020747.pdf")));
> parser.parse();
> PDFTextStripper ts = new PDFTextStripper();
> ts.setForceParsing(true);
> ts.writeText(parser.getPDDocument(), out);



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (PDFBOX-1757) Errors parsing/extracting text from a PDF

Reply via email to