[ https://issues.apache.org/jira/browse/PDFBOX-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler updated PDFBOX-5992: --------------------------------------- Affects Version/s: 4.0.0 > Inline image bug with multi-byte newline tokens > ----------------------------------------------- > > Key: PDFBOX-5992 > URL: https://issues.apache.org/jira/browse/PDFBOX-5992 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 2.0.33, 3.0.4 PDFBox, 4.0.0 > Reporter: Ben Plate > Assignee: Andreas Lehmkühler > Priority: Major > Attachments: EDGE11896203.pdf, broken-2.pdf, broken.pdf > > Original Estimate: 1h > Remaining Estimate: 1h > > There seems to be an issue with parsing inline image streams which are > written with multi-byte whitespace tokens causing the image data to be > prepended with whitespace. I've attached an affected PDF to this issue where > you should be able to reproduce the issue by loading the PDF and using the > {{PDFRenderer}} to render the PDF into an image. > > When converting the PDF to an image, I get the following error: > > {{Exception occurred while converting page at index [0] > javax.imageio.IIOException: Not a JPEG stream (starts with: 0x0aff, expected > SOI: 0xffd8)}} > > The exception comes from Twelvemonkeys but there is nothing wrong with the > image itself or that library. The image is zlib-compressed and inserted into > the PDF, but if you open the PDF, extract the zlib-compressed object, and > look at the hex data of the start of the image, you see included the > following data: > > {{4944 0d0a ffd8}} > > This corresponds to the 'ID' token, followed by a newline, followed by the > start of the object stream, which is a JPEG image as indicated by the SOI > indicator {{{}ffd8{}}}. However, you'll notice that the new line character is > multi-byte: it's a carriage return followed by a line feed character. > According to the PDF specification ISO 32000-2: > > ??The PDF character set is divided into three classes referred to as regular, > delimiter, and white-space characters. This classification enables the > grouping of characters into tokens...PDF treats any sequence of consecutive > whitespace characters, not inside of a string or stream, as one character.?? > > Furthermore, regarding the section on inline images: > > ??The bytes between the ID operator and a whitespace token, but before the EI > operator shall be treated the same as a stream object's data (see 7.3.8, > "Stream objects"), even though they do not follow the standard stream > syntax.?? > > This means that an arbitrary number of whitespace bytes can appear between > the 'ID' token and the start of the image stream according to the PDF > specification. However, in the {{PDFStreamParser}} class, we observe the > following logic for stripping this whitespace: > > {{if( isWhitespace() )}} > { > {{ //pull off the whitespace character}} > {{ source.read();}} > {{}}} > > This assumes a single-byte newline character, but does not properly handle > multi-byte newlines. As such, in the example the {{0d}} character gets > skipped, but the {{0a}} character isn't and is included in the {{imageData}} > byte array, and when passed on to Twelvemonkeys an exception is thrown. > > The solution here is to simply replace the if statement with a while loop. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org