[ https://issues.apache.org/jira/browse/PDFBOX-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056634#comment-13056634 ]
Adam Nichols commented on PDFBOX-1037: -------------------------------------- Also, PDFBOX-1000 is to write a conforming parser which will parse the PDF starting at the end of the file. The way it is set up right now (bear in mind that it's still far from finished) it will throw an exception if the document is non-conforming and will give you an error message which should direct you to the part of the PDF which is problematic. There's also an option to just forge ahead and try to parse it anyway. This will allow people to check PDF files for conformity, or just attempt to parse whatever junk is thrown its way. Since you said you are using PDFBox as a clean up tool, this will probably be something you'll be interested in. Try parsing with the conforming parser, if that works, you know the PDF is in pristine condition. If that throws an exception, you know you have a "dirty" PDF and you can handle that as you see fit. The problem is that the parser isn't done yet. But I plan on committing it after 1.6.0 is released to make it easier for people to lend a hand. If you are anxious to check it out you can check out the patches on JIRA. > PDF with multiple %%EOF only parses one page > -------------------------------------------- > > Key: PDFBOX-1037 > URL: https://issues.apache.org/jira/browse/PDFBOX-1037 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 1.5.0 > Environment: Windows XP - Java SE 1.6 > Reporter: Abraham Farris > Attachments: blankpageproblemmod.pdf, blankpageproblemmod.png > > > Any type of page counts (getDocumentCatalog().getPages().getCount()) only > return int 1. Doing a simple .load and .save will strip out all pages after > the first. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira