[ 
https://issues.apache.org/jira/browse/PDFBOX-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686712#comment-13686712
 ] 

Timo Boehme commented on PDFBOX-1641:
-------------------------------------

It is common that PDF documents may have garbage between referenced objects. 
This is only a problem for the sequentially working parser. If one uses the 
NonSequentialPDFParser (via PDDocument.loadNonSeq()) the garbage is not touched 
and thus no problem. Please give it a test.

I'm not sure if the base parser should be changed since it is ok for it to 
throw an IOException if the object cannot be read.
                
> Parsing of PDFs fails when no '<<' between directory objects
> ------------------------------------------------------------
>
>                 Key: PDFBOX-1641
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1641
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.2, 2.0.0
>            Reporter: Niko Ojanen
>         Attachments: pdfbox-1641_patch.txt, pdfbox-1641_sample.pdf
>
>
> PDF's that are missing {{<<}} between two COS directory objects are failing 
> with parsing. 
> E.g.
> {noformat}
> >>
> endobj
> 2 0 obj
> >>
> endobj
> 5 0 obj
> <<
> {noformat}
> The fix for handling these situations is adding under 20 lines of code to 
> {{BaseParser.java}} (which I'd like to contribute).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to