[ https://issues.apache.org/jira/browse/PDFBOX-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946592#comment-17946592 ]
Tilman Hausherr edited comment on PDFBOX-5992 at 4/23/25 4:13 AM: ------------------------------------------------------------------ [~bplate] Your second broken file also fails to render with Adobe. [~lehmi] That change destroys the rendering of the "yeung" file from PDFBOX-31, and the file from PDFBOX-3046 (and a few more). was (Author: tilman): That change destroys the rendering of the "yeung" file from PDFBOX-31, and the file from PDFBOX-3046 (and a few more). > Inline image bug with multi-byte newline tokens > ----------------------------------------------- > > Key: PDFBOX-5992 > URL: https://issues.apache.org/jira/browse/PDFBOX-5992 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 2.0.33, 3.0.4 PDFBox > Reporter: Ben Plate > Priority: Major > Attachments: EDGE11896203.pdf, broken-2.pdf, broken.pdf > > Original Estimate: 1h > Remaining Estimate: 1h > > There seems to be an issue with parsing inline image streams which are > written with multi-byte whitespace tokens causing the image data to be > prepended with whitespace. I've attached an affected PDF to this issue where > you should be able to reproduce the issue by loading the PDF and using the > {{PDFRenderer}} to render the PDF into an image. > > When converting the PDF to an image, I get the following error: > > {{Exception occurred while converting page at index [0] > javax.imageio.IIOException: Not a JPEG stream (starts with: 0x0aff, expected > SOI: 0xffd8)}} > > The exception comes from Twelvemonkeys but there is nothing wrong with the > image itself or that library. The image is zlib-compressed and inserted into > the PDF, but if you open the PDF, extract the zlib-compressed object, and > look at the hex data of the start of the image, you see included the > following data: > > {{4944 0d0a ffd8}} > > This corresponds to the 'ID' token, followed by a newline, followed by the > start of the object stream, which is a JPEG image as indicated by the SOI > indicator {{{}ffd8{}}}. However, you'll notice that the new line character is > multi-byte: it's a carriage return followed by a line feed character. > According to the PDF specification ISO 32000-2: > > ??The PDF character set is divided into three classes referred to as regular, > delimiter, and white-space characters. This classification enables the > grouping of characters into tokens...PDF treats any sequence of consecutive > whitespace characters, not inside of a string or stream, as one character.?? > > Furthermore, regarding the section on inline images: > > ??The bytes between the ID operator and a whitespace token, but before the EI > operator shall be treated the same as a stream object's data (see 7.3.8, > "Stream objects"), even though they do not follow the standard stream > syntax.?? > > This means that an arbitrary number of whitespace bytes can appear between > the 'ID' token and the start of the image stream according to the PDF > specification. However, in the {{PDFStreamParser}} class, we observe the > following logic for stripping this whitespace: > > {{if( isWhitespace() )}} > { > {{ //pull off the whitespace character}} > {{ source.read();}} > {{}}} > > This assumes a single-byte newline character, but does not properly handle > multi-byte newlines. As such, in the example the {{0d}} character gets > skipped, but the {{0a}} character isn't and is included in the {{imageData}} > byte array, and when passed on to Twelvemonkeys an exception is thrown. > > The solution here is to simply replace the if statement with a while loop. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org