I have spent some time investigating why PDFBox fails to parse the PDF
from  https://issues.apache.org/jira/browse/PDFBOX-789.

After a long debugging session I finally found what's wrong with the PDF
(well at least with one part).

The PDF contains multiple inline images. With one of the images it
appears that the image data (the data after the DI token) contains EI
which is an end token. The first EI however, is part of the image and
should not end the image data.

The following snippet shows that there is an EI which is part of the
image data:

BI
/CS/RGB
/W 795
/H 1
/BPC 8
/F/Fl
/DP<</Predictor 15
/Columns 795
/Colors 3>>
ID x<9c>í<95>ÁmÃ0^LEI ...<-- the 'wrong' EI
EI Q <-- the correct EI
q 795 0 0 1 1863 3028.67 cm
BI

The correct EI is separated by a newline (0x0A) and a space (0x20), i.e.
<0x0A>EI<0x20>. The wrong EI is separated by a formfeed (0x0C) and a
newline, i.e. <0x0C>EI<0x0A>

PDFStreamParser already notices that the PDF specs are not really clear:

"PDF spec is kinda unclear about this.  Should a whitespace always
appear before EI?"

Is there something we can do to make this more robust? Should the EI
always end with a space?

Kind regards,

Martijn Brinkers

Reply via email to