[jira] Commented: (PDFBOX-506) PDFBox can't parse PDF documents from jstor.org

Wiley Fuller (JIRA) Fri, 27 Aug 2010 04:08:24 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903364#action_12903364
 ]


Wiley Fuller commented on PDFBOX-506:
-------------------------------------

Hi Thomas.  The file already attached to this issue (siegel.pdf) can be used to 
replicate the problem in 1.2.1.   However, I just checked out and built 
1.3.0-SNAPSHOT, and it works without any problems. 

Thanks.

> PDFBox can't parse PDF documents from jstor.org
> -----------------------------------------------
>
>                 Key: PDFBOX-506
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-506
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Dave Engberg
>         Attachments: siegel.pdf
>
>
> The academic repository JStor makes papers available via PDF format.  The 
> PDFs give this origin information:
>   Content creator:  JstorPdfGenerator v1.0
>   PDF producer:  iText 2.0.6 (by lowagie.com)
> These PDFs open fine in Acrobat, Preview, FoxIt, etc., but they throw an 
> exception in PDFBox:
> Exception in thread "main" java.io.IOException: Error: Expected to read 
> '%%EOF' instead started reading '1'
>       at 
> org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1005)
>       at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:456)
>       at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:706)
>       at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:393)
>       at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:369)
>       at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:355)
> I traced through the code, and it appears that PDFBox rejects these because 
> they contain a 'startxref' that is not followed by a %%EOF two lines later:
> ...
> startxref
> 613364
> 1 0 obj
> ...
> Here's a small patch that will accept files that are missing the EOF after 
> the startxref:
> Index: src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java
> ===================================================================
> --- src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java  (revision 
> 802578)
> +++ src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java  (working copy)
> @@ -453,11 +453,9 @@
>              {  
>                  parseStartXref();
>                  //verify that EOF exists 
> -                String eof = readExpectedString( "%%EOF" );
> -                if( eof.indexOf( "%%EOF" )== -1 && !pdfSource.isEOF() )
> -                {
> -                    throw new IOException( "expected='%%EOF' actual='" + eof 
> + "' next=" + readString() +
> -                            " next=" +readString() );
> +                int c = pdfSource.peek();
> +                if (c == '%') {
> +                    readExpectedString("%%EOF");
>                  }
>                  isEndOfFile = true; 
>              }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-506) PDFBox can't parse PDF documents from jstor.org

Reply via email to