[ 
https://issues.apache.org/jira/browse/PDFBOX-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Olsen updated PDFBOX-2016:
---------------------------------

    Attachment: Hello_broken.pdf
                Hello.pdf

One of these PDFs is corrupt (incorrect stream length) but most PDF viewers can 
deal with this and both will appear the same. In PDFBox, all text is lost in 
the corrupt one.

> Stream parsing still incorrect if length value is wrong
> -------------------------------------------------------
>
>                 Key: PDFBOX-2016
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2016
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.6.0, 1.8.4
>            Reporter: Andrew Olsen
>         Attachments: Hello.pdf, Hello_broken.pdf
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> From issue PDFBOX-1333 - "In 1.7.0 stream parsing in BaseParser was optimized 
> to use length value if available. The advantage is faster parsing and 
> independence of 'endstream' bytes sequences in stream. However the 
> disadvantage is that streams with wrong length values cannot be parsed 
> anymore" - etc. 
> This issue was marked as fixed now that COSStreams can once again be parsed 
> by reading all the way to 'endstream'. However, the resulting COSStream 
> object still contains the expected length, not the true length. When parsing 
> the COSStream with a PDFStreamParser, the call to 
> COSStream#getUnfilteredStream uses getLength() instead of getLengthWritten to 
> limit the amount of data that can be read. This can truncate the stream and 
> means that incorrect length values still lead to missing data, and so limits 
> the usefulness of the last fix. Changing the call to getLengthWritten should 
> solve the problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to