[
https://issues.apache.org/jira/browse/PDFBOX-4973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler closed PDFBOX-4973.
--------------------------------------
Assignee: Andreas Lehmkühler
Resolution: Not A Problem
> Parsing truncated files no longer throws IOException: Error reading stream,
> expected='endstream' actual='' at offset ...
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-4973
> URL: https://issues.apache.org/jira/browse/PDFBOX-4973
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.7, 2.0.8, 2.0.9, 2.0.10, 2.0.11, 2.0.12, 2.0.13,
> 2.0.14, 2.0.15, 2.0.16, 2.0.17, 2.0.18, 2.0.19, 2.0.20, 2.0.21
> Environment: Ubuntu 16.04
> Reporter: Plamen Penchev
> Assignee: Andreas Lehmkühler
> Priority: Major
> Attachments: truncated-with-eof.pdf, truncated.pdf
>
>
> h3. Issue:
> An exception is no longer thrown post-2.0.6, when a stream of a truncated PDF
> file is parsed.
> In 2.0.6 *COSParser's parseCOSStream* throws *"java.io.IOException: Error
> reading stream, expected='endstream' actual='' at offset ..."*. Whereas >=
> 2.0.7 the parsing is successful. Shall an EOF marker be added to the
> truncated file, however, the expected exception is thrown once again.
> The code below is the minimum setup for reproducing the behavior (_in
> conjunction with the respective file attached_):
> {code:java}
> import org.apache.tika.exception.TikaException;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.pdf.PDFParserConfig;
> import org.apache.tika.sax.BodyContentHandler;
> import org.xml.sax.SAXException;
> import java.io.File;
> import java.io.IOException;
> public class Main {
> public static void main(String[] args) {
> File inputFile = new File("/path/to/parent/folder",
> "truncated.pdf");
> try {
> // metadata will be extracted by Tika
> Metadata meta = new Metadata();
> meta.set(Metadata.CONTENT_TYPE, "application/pdf");
> BodyContentHandler ch = new BodyContentHandler(-1);
> AutoDetectParser parser = new AutoDetectParser();
> PDFParserConfig pdfParserConfig = new
> PDFParserConfig();
> pdfParserConfig.setOcrStrategy("no_ocr");
> pdfParserConfig.setMaxMainMemoryBytes(209715200);
> ParseContext parseContext = new ParseContext();
> parseContext.set(PDFParserConfig.class,
> pdfParserConfig);
> try (TikaInputStream is =
> TikaInputStream.get(inputFile.toPath())) {
> // try to parse the document
> parser.parse(is, ch, meta, parseContext);
> }
> } catch (TikaException | SAXException | IOException ex) {
> // expect to enter catch
> } finally {
> // instead catch is skipped
> }
> }
> }
> {code}
> The stack looks like this:
> ||parseCOSStream||COSParser||(pdfbox)||
> ||parseFileObject||COSParser||(pdfbox)||
> ||parseObjectDynamically||COSParser||(pdfbox)||
> ||parseDictObjects||COSParser||(pdfbox)||
> ||initialParse||PDFParser||(pdfbox)||
> ||parse||PDFParser||(pdfbox)||
> ||load||PDDocument||(pdfbox)||
> ||parse||PDFParser||(tika-parsers)||
> ||parse||CompositeParser||(tika-parsers)||
> In 2.0.6 the IOException thrown in parseCOSStream is caught in tika's
> CompositeParser parse method, and rethrown as TikaException, which we then
> expect internally and handle it in the sample code provided.
> h3. Why I believe this is a regression:
> [https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf]:
> In this specification Adobe describes the structure of PDF1.7, the basis for
> the ISO 32000 standard.
> Under the *(7) Syntax clause*, there is a *(7.5) File Structure* sub-clause
> which describes the valid pdf file structure.
> *This abstract is from sub-sub clause (7.5.5) File Trailer:*
> ------
> The _trailer_ of a PDF file enables a conforming reader to quickly find the
> cross-reference table and certain special objects. Conforming readers should
> read a PDF file from its end. +The last line of the file shall contain only
> the end-of-file marker, *%%EOF*.+ The two preceding lines shall contain, one
> per line and in order, the keyword *startxref* and the byte offset in the
> decoded stream from the beginning of the file to the beginning of the *xref*
> keyword in the last cross-reference section.
> ------
> Additionally the document in question cannot be previewed as it is considered
> broken by pdf previewers.
> h3. What introduced this change in parsing:
> I investigated and tested what introduced this change in behavior.
> The PDFBOX-3798 issue's resolution
> [https://svn.apache.org/viewvc/pdfbox/branches/2.0/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java?r1=1795704&r2=1795703&pathrev=1795704]
> is where the change in behavior stems from.
> I have tested rebuilding both 2.0.7 and 2.0.19 from their source code after
> reverting the change introduced by the commit above. This brings the behavior
> back to throwing "java.io.IOException: Error reading stream,
> expected='endstream' actual='' at offset ..." again.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]