Wrong implemented stream reader
-------------------------------

                 Key: PDFBOX-1098
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1098
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
            Reporter: Thomas Chojecki
            Priority: Critical


The BaseParser#readUntilEndStream(OutputStream) method is parsing streams the 
wrong way. [1]

This method will start reading a stream till the keyword "endstream" is reached 
and don't care about the length value inside the dictionary. This 
implementation brokes nearly every pdf document with a pdf embedded inside a 
stream [2].

Encoder that is used for compressing streams can be block-based (like 
FlateDecode which is mostly used). If a block of data that should be compressed 
don't spare space after compressing, the encode do not compress this block and 
mark it as uncompressed. So a stream can containing compressed and uncompressed 
parts. So if someone try to embed pdf documents with streams inside a stream, 
the encoder will left most parts of the document uncompressed. Such parts can 
contain plan text like "endstream" or other critical keywords that can cause 
the parser to stop. 

So we need to read the whole stream length that was wrote inside the dictionary 
and don't look at "endstream" keywords until the end is reached.

The current stream parser cause a ZIPException with the Message "Unexpected end 
of ZLIB input stream".

A sample pdf and a patch is coming soon.

[1] PDF 32000-1:2008 -> 7.3.8.2 Stream Extent
[2] PDF 32000-1:2008 -> 7.11.4 Embedded File Streams

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to