[ 
https://issues.apache.org/jira/browse/PDFBOX-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950321#comment-17950321
 ] 

Andreas Lehmkühler commented on PDFBOX-5992:
--------------------------------------------

I've changed the parser, so that it skips either a line break (CR, LF or CRLF) 
or any one-byte whitespace. It works with the attached file and seems not to 
destroy the rendering of any other file like such [~tilman] provided earlier.

> Inline image bug with multi-byte newline tokens
> -----------------------------------------------
>
>                 Key: PDFBOX-5992
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5992
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.33, 3.0.4 PDFBox
>            Reporter: Ben Plate
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>         Attachments: EDGE11896203.pdf, broken-2.pdf, broken.pdf
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> There seems to be an issue with parsing inline image streams which are 
> written with multi-byte whitespace tokens causing the image data to be 
> prepended with whitespace. I've attached an affected PDF to this issue where 
> you should be able to reproduce the issue by loading the PDF and using the 
> {{PDFRenderer}} to render the PDF into an image.
>  
> When converting the PDF to an image, I get the following error:
>  
> {{Exception occurred while converting page at index [0] 
> javax.imageio.IIOException: Not a JPEG stream (starts with: 0x0aff, expected 
> SOI: 0xffd8)}}
>  
> The exception comes from Twelvemonkeys but there is nothing wrong with the 
> image itself or that library. The image is zlib-compressed and inserted into 
> the PDF, but if you open the PDF, extract the zlib-compressed object, and 
> look at the hex data of the start of the image, you see included the 
> following data:
>  
> {{4944 0d0a ffd8}}
>  
> This corresponds to the 'ID' token, followed by a newline, followed by the 
> start of the object stream, which is a JPEG image as indicated by the SOI 
> indicator {{{}ffd8{}}}. However, you'll notice that the new line character is 
> multi-byte: it's a carriage return followed by a line feed character. 
> According to the PDF specification ISO 32000-2:
>  
> ??The PDF character set is divided into three classes referred to as regular, 
> delimiter, and white-space characters. This classification enables the 
> grouping of characters into tokens...PDF treats any sequence of consecutive 
> whitespace characters, not inside of a string or stream, as one character.??
>  
> Furthermore, regarding the section on inline images:
>  
> ??The bytes between the ID operator and a whitespace token, but before the EI 
> operator shall be treated the same as a stream object's data (see 7.3.8, 
> "Stream objects"), even though they do not follow the standard stream 
> syntax.??
>  
> This means that an arbitrary number of whitespace bytes can appear between 
> the 'ID' token and the start of the image stream according to the PDF 
> specification. However, in the {{PDFStreamParser}} class, we observe the 
> following logic for stripping this whitespace:
>  
> {{if( isWhitespace() )}}
> {
> {{    //pull off the whitespace character}}
> {{    source.read();}}
> {{}}}
>  
> This assumes a single-byte newline character, but does not properly handle 
> multi-byte newlines. As such, in the example the {{0d}} character gets 
> skipped, but the {{0a}} character isn't and is included in the {{imageData}} 
> byte array, and when passed on to Twelvemonkeys an exception is thrown.
>  
> The solution here is to simply replace the if statement with a while loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to