[
https://issues.apache.org/jira/browse/PDFBOX-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056634#comment-13056634
]
Adam Nichols commented on PDFBOX-1037:
--------------------------------------
Also, PDFBOX-1000 is to write a conforming parser which will parse the PDF
starting at the end of the file. The way it is set up right now (bear in mind
that it's still far from finished) it will throw an exception if the document
is non-conforming and will give you an error message which should direct you to
the part of the PDF which is problematic. There's also an option to just forge
ahead and try to parse it anyway. This will allow people to check PDF files
for conformity, or just attempt to parse whatever junk is thrown its way.
Since you said you are using PDFBox as a clean up tool, this will probably be
something you'll be interested in. Try parsing with the conforming parser, if
that works, you know the PDF is in pristine condition. If that throws an
exception, you know you have a "dirty" PDF and you can handle that as you see
fit. The problem is that the parser isn't done yet. But I plan on committing
it after 1.6.0 is released to make it easier for people to lend a hand. If you
are anxious to check it out you can check out the patches on JIRA.
> PDF with multiple %%EOF only parses one page
> --------------------------------------------
>
> Key: PDFBOX-1037
> URL: https://issues.apache.org/jira/browse/PDFBOX-1037
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.5.0
> Environment: Windows XP - Java SE 1.6
> Reporter: Abraham Farris
> Attachments: blankpageproblemmod.pdf, blankpageproblemmod.png
>
>
> Any type of page counts (getDocumentCatalog().getPages().getCount()) only
> return int 1. Doing a simple .load and .save will strip out all pages after
> the first.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira