[jira] [Comment Edited] (PDFBOX-5992) Inline image bug with multi-byte newline tokens

Michael Klink (Jira) Wed, 23 Apr 2025 00:47:18 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946645#comment-17946645
 ]


Michael Klink edited comment on PDFBOX-5992 at 4/23/25 7:46 AM:
----------------------------------------------------------------

Considering the starting assumptions:

{quote}_—The PDF character set is divided into three classes referred to as 
regular, delimiter, and white-space characters. This classification enables the 
grouping of characters into tokens...PDF treats any sequence of consecutive 
whitespace characters, not inside of a string or stream, as one character._
 
Furthermore, regarding the section on inline images:
 
_—The bytes between the ID operator and a whitespace token, but before the EI 
operator shall be treated the same as a stream object's data (see 7.3.8, 
"Stream objects"), even though they do not follow the standard stream syntax._
 
This means that an arbitrary number of whitespace bytes can appear between the 
'ID' token and the start of the image stream according to the PDF 
specification.{quote}

In the same section you will also find:

_—Unless the image uses *ASCIIHexDecode* or *ASCII85Decode* as -one of its 
filters- its final or only filter, the *ID* operator shall be followed by a 
single white-space character, and the next character shall be interpreted as 
the first byte of image data._

Thus, the _whitespace token_ you found and quoted usually may only consist of a 
single whitespace _byte_.

In general, therefore, blindly applying {{skipWhiteSpaces()}} is wrong! 

In case that the image data cannot start with a whitespace, though, skipping 
whitespaces _can_ repair broken PDFs, but _only in that case_!


was (Author: mkl):
Considering the starting assumptions:

{quote}_—The PDF character set is divided into three classes referred to as 
regular, delimiter, and white-space characters. This classification enables the 
grouping of characters into tokens...PDF treats any sequence of consecutive 
whitespace characters, not inside of a string or stream, as one character._
 
Furthermore, regarding the section on inline images:
 
_—The bytes between the ID operator and a whitespace token, but before the EI 
operator shall be treated the same as a stream object's data (see 7.3.8, 
"Stream objects"), even though they do not follow the standard stream syntax._
 
This means that an arbitrary number of whitespace bytes can appear between the 
'ID' token and the start of the image stream according to the PDF 
specification.{quote}

In the same section you will also find:

_—Unless the image uses *ASCIIHexDecode* or *ASCII85Decode* as -one of its 
filters- its final or only filter, the *ID* operator shall be followed by a 
single white-space character, and the next character shall be interpreted as 
the first byte of image data._

Thus, the _whitespace token_ you found and quoted usually may only consist of a 
single whitespace _byte_.

> Inline image bug with multi-byte newline tokens
> -----------------------------------------------
>
>                 Key: PDFBOX-5992
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5992
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.33, 3.0.4 PDFBox
>            Reporter: Ben Plate
>            Priority: Major
>         Attachments: EDGE11896203.pdf, broken-2.pdf, broken.pdf
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> There seems to be an issue with parsing inline image streams which are 
> written with multi-byte whitespace tokens causing the image data to be 
> prepended with whitespace. I've attached an affected PDF to this issue where 
> you should be able to reproduce the issue by loading the PDF and using the 
> {{PDFRenderer}} to render the PDF into an image.
>  
> When converting the PDF to an image, I get the following error:
>  
> {{Exception occurred while converting page at index [0] 
> javax.imageio.IIOException: Not a JPEG stream (starts with: 0x0aff, expected 
> SOI: 0xffd8)}}
>  
> The exception comes from Twelvemonkeys but there is nothing wrong with the 
> image itself or that library. The image is zlib-compressed and inserted into 
> the PDF, but if you open the PDF, extract the zlib-compressed object, and 
> look at the hex data of the start of the image, you see included the 
> following data:
>  
> {{4944 0d0a ffd8}}
>  
> This corresponds to the 'ID' token, followed by a newline, followed by the 
> start of the object stream, which is a JPEG image as indicated by the SOI 
> indicator {{{}ffd8{}}}. However, you'll notice that the new line character is 
> multi-byte: it's a carriage return followed by a line feed character. 
> According to the PDF specification ISO 32000-2:
>  
> ??The PDF character set is divided into three classes referred to as regular, 
> delimiter, and white-space characters. This classification enables the 
> grouping of characters into tokens...PDF treats any sequence of consecutive 
> whitespace characters, not inside of a string or stream, as one character.??
>  
> Furthermore, regarding the section on inline images:
>  
> ??The bytes between the ID operator and a whitespace token, but before the EI 
> operator shall be treated the same as a stream object's data (see 7.3.8, 
> "Stream objects"), even though they do not follow the standard stream 
> syntax.??
>  
> This means that an arbitrary number of whitespace bytes can appear between 
> the 'ID' token and the start of the image stream according to the PDF 
> specification. However, in the {{PDFStreamParser}} class, we observe the 
> following logic for stripping this whitespace:
>  
> {{if( isWhitespace() )}}
> {
> {{    //pull off the whitespace character}}
> {{    source.read();}}
> {{}}}
>  
> This assumes a single-byte newline character, but does not properly handle 
> multi-byte newlines. As such, in the example the {{0d}} character gets 
> skipped, but the {{0a}} character isn't and is included in the {{imageData}} 
> byte array, and when passed on to Twelvemonkeys an exception is thrown.
>  
> The solution here is to simply replace the if statement with a while loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5992) Inline image bug with multi-byte newline tokens

Reply via email to