[
https://issues.apache.org/jira/browse/PDFBOX-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280988#comment-13280988
]
Timo Boehme commented on PDFBOX-1098:
-------------------------------------
I was referring to the sequentially reading PDFParser. This parser can only use
length information if it is a direct value in stream dictionary. In case it is
a indirect object it will in most cases be placed after the stream (thus not
available at stream parsing time) or in the seldom case of placed in front of
the stream it might not be the correct value because the object could be
redefined later and selected by the corresponding xref table entry. These
problems are gone with the new parser.
> Wrong implemented stream reader
> -------------------------------
>
> Key: PDFBOX-1098
> URL: https://issues.apache.org/jira/browse/PDFBOX-1098
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Reporter: Thomas Chojecki
> Assignee: Timo Boehme
> Priority: Critical
> Labels: ZLIB
> Fix For: 1.7.0
>
>
> The BaseParser#readUntilEndStream(OutputStream) method is parsing streams the
> wrong way. [1]
> This method will start reading a stream till the keyword "endstream" is
> reached and don't care about the length value inside the dictionary. This
> implementation brokes nearly every pdf document with a pdf embedded inside a
> stream [2].
> Encoder that is used for compressing streams can be block-based (like
> FlateDecode which is mostly used). If a block of data that should be
> compressed don't spare space after compressing, the encode do not compress
> this block and mark it as uncompressed. So a stream can containing compressed
> and uncompressed parts. So if someone try to embed pdf documents with streams
> inside a stream, the encoder will left most parts of the document
> uncompressed. Such parts can contain plan text like "endstream" or other
> critical keywords that can cause the parser to stop.
> So we need to read the whole stream length that was wrote inside the
> dictionary and don't look at "endstream" keywords until the end is reached.
> The current stream parser cause a ZIPException with the Message "Unexpected
> end of ZLIB input stream".
> A sample pdf and a patch is coming soon.
> [1] PDF 32000-1:2008 -> 7.3.8.2 Stream Extent
> [2] PDF 32000-1:2008 -> 7.11.4 Embedded File Streams
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira