[
https://issues.apache.org/jira/browse/PDFBOX-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280976#comment-13280976
]
Thomas Chojecki commented on PDFBOX-1098:
-----------------------------------------
I haven't take a look at the NonSequentialPDFParser right now, but without a
length, there is no way to read FlateDecoded PDF attachments like in
PDFBOX-1299. Looks like a design issue in pdf. :-)
> Wrong implemented stream reader
> -------------------------------
>
> Key: PDFBOX-1098
> URL: https://issues.apache.org/jira/browse/PDFBOX-1098
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Reporter: Thomas Chojecki
> Assignee: Timo Boehme
> Priority: Critical
> Labels: ZLIB
> Fix For: 1.7.0
>
>
> The BaseParser#readUntilEndStream(OutputStream) method is parsing streams the
> wrong way. [1]
> This method will start reading a stream till the keyword "endstream" is
> reached and don't care about the length value inside the dictionary. This
> implementation brokes nearly every pdf document with a pdf embedded inside a
> stream [2].
> Encoder that is used for compressing streams can be block-based (like
> FlateDecode which is mostly used). If a block of data that should be
> compressed don't spare space after compressing, the encode do not compress
> this block and mark it as uncompressed. So a stream can containing compressed
> and uncompressed parts. So if someone try to embed pdf documents with streams
> inside a stream, the encoder will left most parts of the document
> uncompressed. Such parts can contain plan text like "endstream" or other
> critical keywords that can cause the parser to stop.
> So we need to read the whole stream length that was wrote inside the
> dictionary and don't look at "endstream" keywords until the end is reached.
> The current stream parser cause a ZIPException with the Message "Unexpected
> end of ZLIB input stream".
> A sample pdf and a patch is coming soon.
> [1] PDF 32000-1:2008 -> 7.3.8.2 Stream Extent
> [2] PDF 32000-1:2008 -> 7.11.4 Embedded File Streams
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira