Ben Plate created PDFBOX-5992:
---------------------------------
Summary: Inline image bug with multi-byte newline tokens
Key: PDFBOX-5992
URL: https://issues.apache.org/jira/browse/PDFBOX-5992
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 3.0.4 PDFBox, 2.0.33
Reporter: Ben Plate
Attachments: broken.pdf
There seems to be an issue with parsing inline image streams which are written
with multi-byte whitespace tokens causing the image data to be prepended with
whitespace. I've attached an affected PDF to this issue where you should be
able to reproduce the issue by loading the PDF and using the {{PDFRenderer}} to
render the PDF into an image.
When converting the PDF to an image, I get the following error:
{{Exception occurred while converting page at index [0]
javax.imageio.IIOException: Not a JPEG stream (starts with: 0x0aff, expected
SOI: 0xffd8)}}
The exception comes from Twelvemonkeys but there is nothing wrong with the
image itself or that library. The image is zlib-compressed and inserted into
the PDF, but if you open the PDF, extract the zlib-compressed object, and look
at the hex data of the start of the image, you see included the following data:
{{4944 0d0a ffd8}}
This corresponds to the 'ID' token, followed by a newline, followed by the
start of the object stream, which is a JPEG image as indicated by the SOI
indicator {{{}ffd8{}}}. However, you'll notice that the new line character is
multi-byte: it's a carriage return followed by a line feed character. According
to the PDF specification ISO 32000-2:
??The PDF character set is divided into three classes referred to as regular,
delimiter, and white-space characters. This classification enables the grouping
of characters into tokens...PDF treats any sequence of consecutive whitespace
characters, not inside of a string or stream, as one character.??
Furthermore, regarding the section on inline images:
??The bytes between the ID operator and a whitespace token, but before the EI
operator shall be treated the same as a stream object's data (see 7.3.8,
"Stream objects"), even though they do not follow the standard stream syntax.??
This means that an arbitrary number of whitespace bytes can appear between the
'ID' token and the start of the image stream according to the PDF
specification. However, in the {{PDFStreamParser}} class, we observe the
following logic for stripping this whitespace:
{{if( isWhitespace() )}}
{{{}}
{{ //pull off the whitespace character}}
{{ source.read();}}
{{}}}
This assumes a single-byte newline character, but does not properly handle
multi-byte newlines. As such, in the example the {{0d}} character gets skipped,
but the {{0a}} character isn't and is included in the {{imageData}} byte array,
and when passed on to Twelvemonkeys an exception is thrown.
The solution here is to simply replace the if statement with a while loop.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]