[
https://issues.apache.org/jira/browse/PDFBOX-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946592#comment-17946592
]
Tilman Hausherr edited comment on PDFBOX-5992 at 4/23/25 4:13 AM:
------------------------------------------------------------------
[~bplate] Your second broken file also fails to render with Adobe.
[~lehmi] That change destroys the rendering of the "yeung" file from PDFBOX-31,
and the file from PDFBOX-3046 (and a few more).
was (Author: tilman):
That change destroys the rendering of the "yeung" file from PDFBOX-31, and the
file from PDFBOX-3046 (and a few more).
> Inline image bug with multi-byte newline tokens
> -----------------------------------------------
>
> Key: PDFBOX-5992
> URL: https://issues.apache.org/jira/browse/PDFBOX-5992
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.33, 3.0.4 PDFBox
> Reporter: Ben Plate
> Priority: Major
> Attachments: EDGE11896203.pdf, broken-2.pdf, broken.pdf
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> There seems to be an issue with parsing inline image streams which are
> written with multi-byte whitespace tokens causing the image data to be
> prepended with whitespace. I've attached an affected PDF to this issue where
> you should be able to reproduce the issue by loading the PDF and using the
> {{PDFRenderer}} to render the PDF into an image.
>
> When converting the PDF to an image, I get the following error:
>
> {{Exception occurred while converting page at index [0]
> javax.imageio.IIOException: Not a JPEG stream (starts with: 0x0aff, expected
> SOI: 0xffd8)}}
>
> The exception comes from Twelvemonkeys but there is nothing wrong with the
> image itself or that library. The image is zlib-compressed and inserted into
> the PDF, but if you open the PDF, extract the zlib-compressed object, and
> look at the hex data of the start of the image, you see included the
> following data:
>
> {{4944 0d0a ffd8}}
>
> This corresponds to the 'ID' token, followed by a newline, followed by the
> start of the object stream, which is a JPEG image as indicated by the SOI
> indicator {{{}ffd8{}}}. However, you'll notice that the new line character is
> multi-byte: it's a carriage return followed by a line feed character.
> According to the PDF specification ISO 32000-2:
>
> ??The PDF character set is divided into three classes referred to as regular,
> delimiter, and white-space characters. This classification enables the
> grouping of characters into tokens...PDF treats any sequence of consecutive
> whitespace characters, not inside of a string or stream, as one character.??
>
> Furthermore, regarding the section on inline images:
>
> ??The bytes between the ID operator and a whitespace token, but before the EI
> operator shall be treated the same as a stream object's data (see 7.3.8,
> "Stream objects"), even though they do not follow the standard stream
> syntax.??
>
> This means that an arbitrary number of whitespace bytes can appear between
> the 'ID' token and the start of the image stream according to the PDF
> specification. However, in the {{PDFStreamParser}} class, we observe the
> following logic for stripping this whitespace:
>
> {{if( isWhitespace() )}}
> {
> {{ //pull off the whitespace character}}
> {{ source.read();}}
> {{}}}
>
> This assumes a single-byte newline character, but does not properly handle
> multi-byte newlines. As such, in the example the {{0d}} character gets
> skipped, but the {{0a}} character isn't and is included in the {{imageData}}
> byte array, and when passed on to Twelvemonkeys an exception is thrown.
>
> The solution here is to simply replace the if statement with a while loop.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]