[iText-questions] Euro symbol in PdfDoc encoding vs. non breaking space in unicode...

Ludger Bünger Tue, 27 Nov 2007 10:22:35 -0800

Hello everyone,

I believe I have found a bug in Itext but since it is easy to blame 
others while overlooking some crucial information, I leave the 
evaluation whether this actually is a feature and I am simply mistaken 
to the community.


Consider the following code:

outlines.set(bookmarkLevel, new PdfOutline( 
(PdfOutline)outlines.get(parentBookmarkLevel),
destination,
//outlineTitle
"1.\u00a0Sammanfattning"
) );


This creates a bookmark in the PDF with the following appearance: 
"1.€Sammanfattning".

Please not the Euro symbol displaying in the bookmark while the original 
contains \u00a0 which is a unicode nonbreaking space.


After short digging into PDFreference I found the following in the 
Appendix D "Encoding".

1. In PDF 1.3, the euro character was added to the Adobe standard Latin 
character set. It
is encoded as 200 in WinAnsiEncoding and 240 in PDFDocEncoding, 
assigning codes
that were previously unused. Apple changed the Mac OS Latin-text 
encoding for code
333 from the currency character to the euro character. However, this 
incompatible
change has not been reflected in PDF’s MacRomanEncoding, which continues 
to map
code 333 to currency. If the euro character is desired, an encoding 
dictionary can be
used to specify this single difference from MacRomanEncoding.
[...]
6. The space character is also encoded as 312 in MacRomanEncoding and as 
240 in
WinAnsiEncoding. The meaning of this duplicate code is “nonbreaking 
space,” but it is
typographically the same as space.

As we easily observe, PdfEncoding maps octal 240 (hex a0) to the euro 
symbol - opposed to winansi, which goes the way unicode does.

also forcing unicode did not help: new 
PdfString("1.\u00a0Sammanfattning", PdfObject.TEXT_UNICODE) made no 
difference.


After some debugging I found the following lines in PdfString.getBytes():

if (encoding != null && encoding.equals(TEXT_UNICODE) && 
PdfEncodings.isPdfDocEncoding(value))
bytes = PdfEncodings.convertToBytes(value, TEXT_PDFDOCENCODING);
else
bytes = PdfEncodings.convertToBytes(value, encoding);

I made two observations:
1) As long as PdfEncodings.isPdfDocEncoding() believes all symbols to 
belong to PdfDocEncoding, the PdfString will not use a unicode encoded 
TextString but always a PdfDocEncoded one.
But since there is no non breaking space in 
PdfDocEncodings.isPdfDocEncoding should not match in the first place.

2) Non the less even when converting to PdfDocEncoding the convert 
method should not wrongly create a € symbol out of the void.


Regarding issue 1:
The following lines are found in PdfEncoding.isPdfDocEncoding():

if (char1 < 128 || (char1 >= 160 && char1 <= 255)) \\ wrongly matching 
non breaking space (dec 160) to be in pdfDocEncoding
continue;
if (!pdfEncoding.containsKey(char1))
return false;

if we replace this line with the following:

if (char1 < 128 || (char1 > 160 && char1 <= 255)) \\ correctly matching 
non breaking space (dec 160) not to be in pdfDocEncoding
continue;
if (!pdfEncoding.containsKey(char1))
return false;

Then isPdfDocEncoding() detects non breaking spaces correctly.

Regarding issue 2:

The following code is found in PdfEncodings.convertToBytes(String, String):

char char1 = cc[k];
if (char1 < 128 || (char1 >= 160 && char1 <= 255))
c = char1;
else
c = hash.get(char1);

as we can see, char 160 is taken "as is" which indeed causes the 
euro-symbol and thus converts wrong.
When excluding 160 from the range and taking the char from the given 
hash, than at least itext will not convert wrongly but omit the 
non-convertible character.

So after having found two places where itext (imho) wrongly checks for 
 >= 160 where it should check for > 160 I risk the assumption that the 
third place where we compare >= 160 is likely to also be > 160.

Any comments/remarks/flames?

Best regards,

Ludger

-- 
Dipl-Inf. Ludger Bünger
Product Development
Team Martha
- - - - - - - - - - - - - - - -
RealObjects GmbH
Altenkesseler Str. 17/B4
66115 Saarbrücken, Germany
Tel +49 (0)681 98579 0
Fax +49 (0)681 98579 29
http://www.realobjects.com
[EMAIL PROTECTED]
- - - - - - - - - - - - - - - -
Commercial Register: Amtsgericht Saarbrücken, HRB 12016
Managing Directors: Michael Jung, Markus Neurohr
VAT-ID: DE210373115


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/

[iText-questions] Euro symbol in PdfDoc encoding vs. non breaking space in unicode...

Reply via email to