[jira] [Commented] (PDFBOX-1037) PDF with multiple %%EOF only parses one page

Adam Nichols (JIRA) Tue, 28 Jun 2011 10:03:42 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056634#comment-13056634
 ]


Adam Nichols commented on PDFBOX-1037:
--------------------------------------

Also, PDFBOX-1000 is to write a conforming parser which will parse the PDF 
starting at the end of the file.  The way it is set up right now (bear in mind 
that it's still far from finished) it will throw an exception if the document 
is non-conforming and will give you an error message which should direct you to 
the part of the PDF which is problematic.  There's also an option to just forge 
ahead and try to parse it anyway.  This will allow people to check PDF files 
for conformity, or just attempt to parse whatever junk is thrown its way.  
Since you said you are using PDFBox as a clean up tool, this will probably be 
something you'll be interested in.  Try parsing with the conforming parser, if 
that works, you know the PDF is in pristine condition.  If that throws an 
exception, you know you have a "dirty" PDF and you can handle that as you see 
fit.  The problem is that the parser isn't done yet.  But I plan on committing 
it after 1.6.0 is released to make it easier for people to lend a hand.  If you 
are anxious to check it out you can check out the patches on JIRA.

> PDF with multiple %%EOF only parses one page
> --------------------------------------------
>
>                 Key: PDFBOX-1037
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1037
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.5.0
>         Environment: Windows XP - Java SE 1.6
>            Reporter: Abraham Farris
>         Attachments: blankpageproblemmod.pdf, blankpageproblemmod.png
>
>
> Any type of page counts (getDocumentCatalog().getPages().getCount()) only 
> return int 1.  Doing a simple .load and .save will strip out all pages after 
> the first.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1037) PDF with multiple %%EOF only parses one page

Reply via email to