I see what you mean about the spec being vague.  In the examples in the 
PDF, the "EI" is always on its own line (implying a newline).  Given this, 
and the data you found, I'd like to propose that we look for "<0x0A>EI" 
followed by whitespace (I'm choosing whitespace simply because that's what 
the PDF spec says should come after ID in section 8.9.7.  Although they 
don't explicitly say what should follow EI, it'd make sense to be 
consistent with the ID operator).  Whitespace is defined in the PDF spec 
in table 1 (Section 7.2.2 Character Set) as 0x00 0x09 0x0A 0x0C 0x0D or 
0x20.

It seems like this would take care of the PDF in question and be a 
reasonable way to interpret the spec.

I'm still a little concerned about a <0x0A>EI<0x20> appearing within the 
Image data though, but it looks like the spec simply doesn't allow for 
that data to be in an Image.  I'm not sure how much better we can do...

For reference, I'm using ISO32000-1:2008 as "the PDF spec".

---- 
Thanks,
Adam



From:
"martijn.list" <[email protected]>
To:
[email protected]
Date:
11/24/2010 12:51
Subject:
Image data sometimes contains EI. This however should not end the image 
data.



I have spent some time investigating why PDFBox fails to parse the PDF
from  https://issues.apache.org/jira/browse/PDFBOX-789.

After a long debugging session I finally found what's wrong with the PDF
(well at least with one part).

The PDF contains multiple inline images. With one of the images it
appears that the image data (the data after the DI token) contains EI
which is an end token. The first EI however, is part of the image and
should not end the image data.

The following snippet shows that there is an EI which is part of the
image data:

BI
/CS/RGB
/W 795
/H 1
/BPC 8
/F/Fl
/DP<</Predictor 15
/Columns 795
/Colors 3>>
ID x<9c>í<95>ÁmÃ0^LEI ...<-- the 'wrong' EI
EI Q <-- the correct EI
q 795 0 0 1 1863 3028.67 cm
BI

The correct EI is separated by a newline (0x0A) and a space (0x20), i.e.
<0x0A>EI<0x20>. The wrong EI is separated by a formfeed (0x0C) and a
newline, i.e. <0x0C>EI<0x0A>

PDFStreamParser already notices that the PDF specs are not really clear:

"PDF spec is kinda unclear about this.  Should a whitespace always
appear before EI?"

Is there something we can do to make this more robust? Should the EI
always end with a space?

Kind regards,

Martijn Brinkers



- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   Visit  www.swmc.com/resources  
 for helpful links on Training, Webinars, Lender Alerts and Submitting 
Conditions  
This email and any content within or attached hereto from Sun West Mortgage 
Company, Inc. is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or taking any action in reliance on the 
contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any other 
personal or financial information in the content of the email. Should you have 
any questions, please call (800) 453 7884.  

Reply via email to