[jira] [Created] (PDFBOX-4973) Parsing truncated files no longer throws IOException: Error reading stream, expected='endstream' actual='' at offset ...

Plamen Penchev (Jira) Tue, 29 Sep 2020 04:56:49 -0700

Plamen Penchev created PDFBOX-4973:
--------------------------------------

             Summary: Parsing truncated files no longer throws IOException: 
Error reading stream, expected='endstream' actual='' at offset ...
                 Key: PDFBOX-4973
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4973
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 2.0.21, 2.0.20, 2.0.19, 2.0.18, 2.0.17, 2.0.16, 2.0.15, 
2.0.14, 2.0.13, 2.0.12, 2.0.11, 2.0.10, 2.0.9, 2.0.8, 2.0.7
         Environment: Ubuntu 16.04
            Reporter: Plamen Penchev
         Attachments: truncated-with-eof.pdf, truncated.pdf


h3. Issue:

An exception is no longer thrown post-2.0.6, when a stream of a truncated PDF 
file is parsed.

In 2.0.6 *COSParser's parseCOSStream* throws *"java.io.IOException: Error 
reading stream, expected='endstream' actual='' at offset ..."*. Whereas >= 
2.0.7 the parsing is successful. Shall an EOF marker be added to the truncated 
file, however, the expected exception is thrown once again.

The code below is the minimum setup for reproducing the behavior (_in 
conjunction with the respective file attached_):
{code:java}
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

import java.io.File;
import java.io.IOException;

public class Main {

        public static void main(String[] args) {

                File inputFile = new File("/path/to/parent/folder", 
"truncated.pdf");

                try {
                        // metadata will be extracted by Tika
                        Metadata meta = new Metadata();
                        meta.set(Metadata.CONTENT_TYPE, "application/pdf");

                        BodyContentHandler ch = new BodyContentHandler(-1);

                        AutoDetectParser parser = new AutoDetectParser();

                        PDFParserConfig pdfParserConfig = new PDFParserConfig();
                        pdfParserConfig.setOcrStrategy("no_ocr");
                        pdfParserConfig.setMaxMainMemoryBytes(209715200);

                        ParseContext parseContext = new ParseContext();
                        parseContext.set(PDFParserConfig.class, 
pdfParserConfig);

                        try (TikaInputStream is = 
TikaInputStream.get(inputFile.toPath())) {
                                // try to parse the document
                                parser.parse(is, ch, meta, parseContext);
                        }

                } catch (TikaException | SAXException | IOException ex) {

                        // expect to enter catch
                } finally {

                        // instead catch is skipped
                }
        }
}

{code}
The stack looks like this:
||parseCOSStream||COSParser||(pdfbox)||
||parseFileObject||COSParser||(pdfbox)||
||parseObjectDynamically||COSParser||(pdfbox)||
||parseDictObjects||COSParser||(pdfbox)||
||initialParse||PDFParser||(pdfbox)||
||parse||PDFParser||(pdfbox)||
||load||PDDocument||(pdfbox)||
||parse||PDFParser||(tika-parsers)||
||parse||CompositeParser||(tika-parsers)||

In 2.0.6 the IOException thrown in parseCOSStream is caught in tika's 
CompositeParser parse method, and rethrown as TikaException, which we then 
expect internally and handle it in the sample code provided.
h3. Why I believe this is a regression:

[https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf]:

In this specification Adobe describes the structure of PDF1.7, the basis for 
the ISO 32000 standard.

Under the *(7) Syntax clause*, there is a *(7.5) File Structure* sub-clause 
which describes the valid pdf file structure.

*This abstract is from sub-sub clause (7.5.5) File Trailer:*

------
 The _trailer_ of a PDF file enables a conforming reader to quickly find the 
cross-reference table and certain special objects. Conforming readers should 
read a PDF file from its end. +The last line of the file shall contain only the 
end-of-file marker, *%%EOF*.+ The two preceding lines shall contain, one per 
line and in order, the keyword *startxref* and the byte offset in the decoded 
stream from the beginning of the file to the beginning of the *xref* keyword in 
the last cross-reference section.
 ------

Additionally the document in question cannot be previewed as it is considered 
broken by pdf previewers.
h3. What introduced this change in parsing:

I investigated and tested what introduced this change in behavior.

The PDFBOX-3798 issue's resolution 
[https://svn.apache.org/viewvc/pdfbox/branches/2.0/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java?r1=1795704&r2=1795703&pathrev=1795704]
 is where the change in behavior stems from.

I have tested rebuilding both 2.0.7 and 2.0.19 from their source code after 
reverting the change introduced by the commit above. This brings the behavior 
back to throwing "java.io.IOException: Error reading stream, 
expected='endstream' actual='' at offset ..." again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-4973) Parsing truncated files no longer throws IOException: Error reading stream, expected='endstream' actual='' at offset ...

Reply via email to