[
https://issues.apache.org/jira/browse/PDFBOX-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903364#action_12903364
]
Wiley Fuller commented on PDFBOX-506:
-------------------------------------
Hi Thomas. The file already attached to this issue (siegel.pdf) can be used to
replicate the problem in 1.2.1. However, I just checked out and built
1.3.0-SNAPSHOT, and it works without any problems.
Thanks.
> PDFBox can't parse PDF documents from jstor.org
> -----------------------------------------------
>
> Key: PDFBOX-506
> URL: https://issues.apache.org/jira/browse/PDFBOX-506
> Project: PDFBox
> Issue Type: Bug
> Reporter: Dave Engberg
> Attachments: siegel.pdf
>
>
> The academic repository JStor makes papers available via PDF format. The
> PDFs give this origin information:
> Content creator: JstorPdfGenerator v1.0
> PDF producer: iText 2.0.6 (by lowagie.com)
> These PDFs open fine in Acrobat, Preview, FoxIt, etc., but they throw an
> exception in PDFBox:
> Exception in thread "main" java.io.IOException: Error: Expected to read
> '%%EOF' instead started reading '1'
> at
> org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1005)
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:456)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:706)
> at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:393)
> at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:369)
> at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:355)
> I traced through the code, and it appears that PDFBox rejects these because
> they contain a 'startxref' that is not followed by a %%EOF two lines later:
> ...
> startxref
> 613364
> 1 0 obj
> ...
> Here's a small patch that will accept files that are missing the EOF after
> the startxref:
> Index: src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java
> ===================================================================
> --- src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java (revision
> 802578)
> +++ src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java (working copy)
> @@ -453,11 +453,9 @@
> {
> parseStartXref();
> //verify that EOF exists
> - String eof = readExpectedString( "%%EOF" );
> - if( eof.indexOf( "%%EOF" )== -1 && !pdfSource.isEOF() )
> - {
> - throw new IOException( "expected='%%EOF' actual='" + eof
> + "' next=" + readString() +
> - " next=" +readString() );
> + int c = pdfSource.peek();
> + if (c == '%') {
> + readExpectedString("%%EOF");
> }
> isEndOfFile = true;
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.