[jira] Commented: (PDFBOX-506) PDFBox can't parse PDF documents from jstor.org

Thomas Chojecki (JIRA) Fri, 27 Aug 2010 02:42:20 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903355#action_12903355
 ]


Thomas Chojecki commented on PDFBOX-506:
----------------------------------------

Wiley Fuller, do you have some example file that you can attach?

I'm using incremental updates all the time with pdfbox and can't find any issue 
with the latest revision (last update a minute ago). 

> PDFBox can't parse PDF documents from jstor.org
> -----------------------------------------------
>
>                 Key: PDFBOX-506
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-506
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Dave Engberg
>         Attachments: siegel.pdf
>
>
> The academic repository JStor makes papers available via PDF format.  The 
> PDFs give this origin information:
>   Content creator:  JstorPdfGenerator v1.0
>   PDF producer:  iText 2.0.6 (by lowagie.com)
> These PDFs open fine in Acrobat, Preview, FoxIt, etc., but they throw an 
> exception in PDFBox:
> Exception in thread "main" java.io.IOException: Error: Expected to read 
> '%%EOF' instead started reading '1'
>       at 
> org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1005)
>       at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:456)
>       at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:706)
>       at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:393)
>       at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:369)
>       at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:355)
> I traced through the code, and it appears that PDFBox rejects these because 
> they contain a 'startxref' that is not followed by a %%EOF two lines later:
> ...
> startxref
> 613364
> 1 0 obj
> ...
> Here's a small patch that will accept files that are missing the EOF after 
> the startxref:
> Index: src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java
> ===================================================================
> --- src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java  (revision 
> 802578)
> +++ src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java  (working copy)
> @@ -453,11 +453,9 @@
>              {  
>                  parseStartXref();
>                  //verify that EOF exists 
> -                String eof = readExpectedString( "%%EOF" );
> -                if( eof.indexOf( "%%EOF" )== -1 && !pdfSource.isEOF() )
> -                {
> -                    throw new IOException( "expected='%%EOF' actual='" + eof 
> + "' next=" + readString() +
> -                            " next=" +readString() );
> +                int c = pdfSource.peek();
> +                if (c == '%') {
> +                    readExpectedString("%%EOF");
>                  }
>                  isEndOfFile = true; 
>              }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-506) PDFBox can't parse PDF documents from jstor.org

Reply via email to