[jira] Commented: (PDFBOX-506) PDFBox can't parse PDF documents from jstor.org

Mel Martinez (JIRA) Fri, 27 Aug 2010 07:46:18 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903434#action_12903434
 ]


Mel Martinez commented on PDFBOX-506:
-------------------------------------

This sounds very familiar - wasn't this fixed a few months back?

In fact I think I may have suggested a patch for it?   If only I wasn't going 
senile and could remember more clearly ...


> PDFBox can't parse PDF documents from jstor.org
> -----------------------------------------------
>
>                 Key: PDFBOX-506
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-506
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Dave Engberg
>         Attachments: siegel.pdf
>
>
> The academic repository JStor makes papers available via PDF format.  The 
> PDFs give this origin information:
>   Content creator:  JstorPdfGenerator v1.0
>   PDF producer:  iText 2.0.6 (by lowagie.com)
> These PDFs open fine in Acrobat, Preview, FoxIt, etc., but they throw an 
> exception in PDFBox:
> Exception in thread "main" java.io.IOException: Error: Expected to read 
> '%%EOF' instead started reading '1'
>       at 
> org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1005)
>       at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:456)
>       at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
>       at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:706)
>       at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:393)
>       at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:369)
>       at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:355)
> I traced through the code, and it appears that PDFBox rejects these because 
> they contain a 'startxref' that is not followed by a %%EOF two lines later:
> ...
> startxref
> 613364
> 1 0 obj
> ...
> Here's a small patch that will accept files that are missing the EOF after 
> the startxref:
> Index: src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java
> ===================================================================
> --- src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java  (revision 
> 802578)
> +++ src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java  (working copy)
> @@ -453,11 +453,9 @@
>              {  
>                  parseStartXref();
>                  //verify that EOF exists 
> -                String eof = readExpectedString( "%%EOF" );
> -                if( eof.indexOf( "%%EOF" )== -1 && !pdfSource.isEOF() )
> -                {
> -                    throw new IOException( "expected='%%EOF' actual='" + eof 
> + "' next=" + readString() +
> -                            " next=" +readString() );
> +                int c = pdfSource.peek();
> +                if (c == '%') {
> +                    readExpectedString("%%EOF");
>                  }
>                  isEndOfFile = true; 
>              }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-506) PDFBox can't parse PDF documents from jstor.org

Reply via email to