[ https://issues.apache.org/jira/browse/PDFBOX-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946645#comment-17946645 ]
Michael Klink edited comment on PDFBOX-5992 at 4/23/25 7:46 AM: ---------------------------------------------------------------- Considering the starting assumptions: {quote}_—The PDF character set is divided into three classes referred to as regular, delimiter, and white-space characters. This classification enables the grouping of characters into tokens...PDF treats any sequence of consecutive whitespace characters, not inside of a string or stream, as one character._ Furthermore, regarding the section on inline images: _—The bytes between the ID operator and a whitespace token, but before the EI operator shall be treated the same as a stream object's data (see 7.3.8, "Stream objects"), even though they do not follow the standard stream syntax._ This means that an arbitrary number of whitespace bytes can appear between the 'ID' token and the start of the image stream according to the PDF specification.{quote} In the same section you will also find: _—Unless the image uses *ASCIIHexDecode* or *ASCII85Decode* as -one of its filters- its final or only filter, the *ID* operator shall be followed by a single white-space character, and the next character shall be interpreted as the first byte of image data._ Thus, the _whitespace token_ you found and quoted usually may only consist of a single whitespace _byte_. In general, therefore, blindly applying {{skipWhiteSpaces()}} is wrong! In case that the image data cannot start with a whitespace, though, skipping whitespaces _can_ repair broken PDFs, but _only in that case_! was (Author: mkl): Considering the starting assumptions: {quote}_—The PDF character set is divided into three classes referred to as regular, delimiter, and white-space characters. This classification enables the grouping of characters into tokens...PDF treats any sequence of consecutive whitespace characters, not inside of a string or stream, as one character._ Furthermore, regarding the section on inline images: _—The bytes between the ID operator and a whitespace token, but before the EI operator shall be treated the same as a stream object's data (see 7.3.8, "Stream objects"), even though they do not follow the standard stream syntax._ This means that an arbitrary number of whitespace bytes can appear between the 'ID' token and the start of the image stream according to the PDF specification.{quote} In the same section you will also find: _—Unless the image uses *ASCIIHexDecode* or *ASCII85Decode* as -one of its filters- its final or only filter, the *ID* operator shall be followed by a single white-space character, and the next character shall be interpreted as the first byte of image data._ Thus, the _whitespace token_ you found and quoted usually may only consist of a single whitespace _byte_. > Inline image bug with multi-byte newline tokens > ----------------------------------------------- > > Key: PDFBOX-5992 > URL: https://issues.apache.org/jira/browse/PDFBOX-5992 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 2.0.33, 3.0.4 PDFBox > Reporter: Ben Plate > Priority: Major > Attachments: EDGE11896203.pdf, broken-2.pdf, broken.pdf > > Original Estimate: 1h > Remaining Estimate: 1h > > There seems to be an issue with parsing inline image streams which are > written with multi-byte whitespace tokens causing the image data to be > prepended with whitespace. I've attached an affected PDF to this issue where > you should be able to reproduce the issue by loading the PDF and using the > {{PDFRenderer}} to render the PDF into an image. > > When converting the PDF to an image, I get the following error: > > {{Exception occurred while converting page at index [0] > javax.imageio.IIOException: Not a JPEG stream (starts with: 0x0aff, expected > SOI: 0xffd8)}} > > The exception comes from Twelvemonkeys but there is nothing wrong with the > image itself or that library. The image is zlib-compressed and inserted into > the PDF, but if you open the PDF, extract the zlib-compressed object, and > look at the hex data of the start of the image, you see included the > following data: > > {{4944 0d0a ffd8}} > > This corresponds to the 'ID' token, followed by a newline, followed by the > start of the object stream, which is a JPEG image as indicated by the SOI > indicator {{{}ffd8{}}}. However, you'll notice that the new line character is > multi-byte: it's a carriage return followed by a line feed character. > According to the PDF specification ISO 32000-2: > > ??The PDF character set is divided into three classes referred to as regular, > delimiter, and white-space characters. This classification enables the > grouping of characters into tokens...PDF treats any sequence of consecutive > whitespace characters, not inside of a string or stream, as one character.?? > > Furthermore, regarding the section on inline images: > > ??The bytes between the ID operator and a whitespace token, but before the EI > operator shall be treated the same as a stream object's data (see 7.3.8, > "Stream objects"), even though they do not follow the standard stream > syntax.?? > > This means that an arbitrary number of whitespace bytes can appear between > the 'ID' token and the start of the image stream according to the PDF > specification. However, in the {{PDFStreamParser}} class, we observe the > following logic for stripping this whitespace: > > {{if( isWhitespace() )}} > { > {{ //pull off the whitespace character}} > {{ source.read();}} > {{}}} > > This assumes a single-byte newline character, but does not properly handle > multi-byte newlines. As such, in the example the {{0d}} character gets > skipped, but the {{0a}} character isn't and is included in the {{imageData}} > byte array, and when passed on to Twelvemonkeys an exception is thrown. > > The solution here is to simply replace the if statement with a while loop. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org