[
https://issues.apache.org/jira/browse/PDFBOX-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13914187#comment-13914187
]
Tilman Hausherr commented on PDFBOX-908:
----------------------------------------
Could you please try again with the copyrighted files, not the test files, and
loadNonSeq() with the 2nd parameter being null?
> Gracefull handle corrupt PDFs
> -----------------------------
>
> Key: PDFBOX-908
> URL: https://issues.apache.org/jira/browse/PDFBOX-908
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.3.1
> Reporter: Martijn Brinkers
> Attachments: PDFBOX-908.patch, test-corrupt-R.pdf,
> test-integer-too-large.pdf, test-obj-missing-bj.pdf,
> test-stream-missing-endobj.pdf
>
>
> I will use PDFBox for text extraction and one of the main requirements are
> that it should extract as much text as possible. If the PDF document contains
> something that isn't strictly correct according to the PDF specs it should
> try recover gracefully and continue scanning if possible if forceParsing is
> enabled. While testing against a large batch of PDF documents (including
> large ebooks) I found that the parser sometimes stops parsing and/or
> extracting text even with forceParsing enabled. I have attached a patch to
> make PDFBox handle some PDF problems more gracefully when forceParsing is
> enabled.
> Some of my patches tries to handle certain situations differently from the
> existing code. For example the existing code to handle cases when an endobj
> is missing seems to be very complex. In all of my tests it seems to work
> better when the code just assumes that the endobj was missing. Whether or not
> assuming that endobj is missing or whether the existing way to cope with this
> is better is of course debatable.
> A patch is included to handle situations where the data (DI) for an inline
> image contains the EI keyword. The EI is now only accepted if the char before
> EI is an end-of-line marker instead of whitespace.
> I have added the method #isContinueOnError to PDFParser. By default it
> returns forceParsing but implementors can override it to stop parsing when a
> certain limit is reached (for example on a timeout). This can be helpful to
> stop parsing when the parser gets stuck in an unlimited loop.
> BaseParser#readInt unread the data when a NumberFormatException was thrown.
> This resulted in an unlimited loop when forcParsing was enabled when testing
> with test-integer-too-large.pdf (see attached file). I think it's better to
> not unread data when an exception will be thrown because the risks are higher
> that you run into an unlimited loop.
> The other patches are just minor like checks for null values etc.
> I have attached four test PDF documents. These PDF documents are PDFs which I
> corrupted by hand to try to replicate similar situations I found in existing
> (copyrighted) ebooks.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)