[ 
https://issues.apache.org/jira/browse/PDFBOX-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946764#comment-17946764
 ] 

Ben Plate commented on PDFBOX-5992:
-----------------------------------

{quote}In the same section you will also find:

_—Unless the image uses *ASCIIHexDecode* or *ASCII85Decode* as -one of its 
filters- its final or only filter, the *ID* operator shall be followed by a 
single white-space character, and the next character shall be interpreted as 
the first byte of image data._

Thus, the _whitespace token_ you found and quoted usually may only consist of a 
single whitespace {_}byte{_}.

In general, therefore, blindly applying {{skipWhiteSpaces()}} is wrong!

In case that the image data cannot start with a whitespace, though, skipping 
whitespaces _can_ repair broken PDFs, but {_}only in that case{_}!
{quote}
Ah, I missed that. Seems like in that case whatever PDF software was used to 
generate the PDF I encountered doesn't follow this rule, since it uses {{0a0d}} 
between ID and the JPEG SOI.
{quote}[~bplate],

as an aside, your example PDFs include many KBs of trash bytes after the %%EOF. 
Thus, they are broken as PDFs and PDF processors reading them need to repair 
them somehow. This can result in any kind of accidental change to them.

Also there is something fishy about the page content streams: While PDFBox and 
Chrome can inflate them, iText and Adobe Acrobat cannot. Also the *Length1* 
entry of those streams is weird.

I'd recommend using valid files as test files here.
{quote}
I had to remove confidential information before uploading the PDF and I did 
this by removing the bytes unnecessary for this problem in a hex editor. I left 
the inline image in tact but added another whitespace character between the ID 
operator and the start of the image stream. I checked that it opened on my end 
in Chrome but PDFBox returned a missing SOI indicator as a result of only 
skipping two of the whitespace characters. That all being said, considering the 
PDF specification doesn't allow for more than a single whitespace character it 
seems like a moot point. It would appear that the file itself is technically 
malformed and the check in {{DCTFilter}} accounts for it anyway; it was just 
due to the outdated Twelvemonkeys dependency that that check didn't apply.

> Inline image bug with multi-byte newline tokens
> -----------------------------------------------
>
>                 Key: PDFBOX-5992
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5992
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.33, 3.0.4 PDFBox
>            Reporter: Ben Plate
>            Priority: Major
>         Attachments: EDGE11896203.pdf, broken-2.pdf, broken.pdf
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> There seems to be an issue with parsing inline image streams which are 
> written with multi-byte whitespace tokens causing the image data to be 
> prepended with whitespace. I've attached an affected PDF to this issue where 
> you should be able to reproduce the issue by loading the PDF and using the 
> {{PDFRenderer}} to render the PDF into an image.
>  
> When converting the PDF to an image, I get the following error:
>  
> {{Exception occurred while converting page at index [0] 
> javax.imageio.IIOException: Not a JPEG stream (starts with: 0x0aff, expected 
> SOI: 0xffd8)}}
>  
> The exception comes from Twelvemonkeys but there is nothing wrong with the 
> image itself or that library. The image is zlib-compressed and inserted into 
> the PDF, but if you open the PDF, extract the zlib-compressed object, and 
> look at the hex data of the start of the image, you see included the 
> following data:
>  
> {{4944 0d0a ffd8}}
>  
> This corresponds to the 'ID' token, followed by a newline, followed by the 
> start of the object stream, which is a JPEG image as indicated by the SOI 
> indicator {{{}ffd8{}}}. However, you'll notice that the new line character is 
> multi-byte: it's a carriage return followed by a line feed character. 
> According to the PDF specification ISO 32000-2:
>  
> ??The PDF character set is divided into three classes referred to as regular, 
> delimiter, and white-space characters. This classification enables the 
> grouping of characters into tokens...PDF treats any sequence of consecutive 
> whitespace characters, not inside of a string or stream, as one character.??
>  
> Furthermore, regarding the section on inline images:
>  
> ??The bytes between the ID operator and a whitespace token, but before the EI 
> operator shall be treated the same as a stream object's data (see 7.3.8, 
> "Stream objects"), even though they do not follow the standard stream 
> syntax.??
>  
> This means that an arbitrary number of whitespace bytes can appear between 
> the 'ID' token and the start of the image stream according to the PDF 
> specification. However, in the {{PDFStreamParser}} class, we observe the 
> following logic for stripping this whitespace:
>  
> {{if( isWhitespace() )}}
> {
> {{    //pull off the whitespace character}}
> {{    source.read();}}
> {{}}}
>  
> This assumes a single-byte newline character, but does not properly handle 
> multi-byte newlines. As such, in the example the {{0d}} character gets 
> skipped, but the {{0a}} character isn't and is included in the {{imageData}} 
> byte array, and when passed on to Twelvemonkeys an exception is thrown.
>  
> The solution here is to simply replace the if statement with a while loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to