Re: [iText-questions] Euro symbol in PdfDoc encoding vs. non breakingspace in unicode...

Paulo Soares Tue, 27 Nov 2007 10:33:47 -0800

Looks like you found a bug.

Paulo


----- Original Message ----- 
From: "Ludger Bünger" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, November 27, 2007 6:22 PM
Subject: [iText-questions] Euro symbol in PdfDoc encoding vs. non 
breakingspace in unicode...


>
> Hello everyone,
>
> I believe I have found a bug in Itext but since it is easy to blame
> others while overlooking some crucial information, I leave the
> evaluation whether this actually is a feature and I am simply mistaken
> to the community.
>
> Consider the following code:
>
> outlines.set(bookmarkLevel, new PdfOutline(
> (PdfOutline)outlines.get(parentBookmarkLevel),
> destination,
> //outlineTitle
> "1.\u00a0Sammanfattning"
> ) );
>
>
> This creates a bookmark in the PDF with the following appearance:
> "1.€Sammanfattning".
>
> Please not the Euro symbol displaying in the bookmark while the original
> contains \u00a0 which is a unicode nonbreaking space.
>
>
> After short digging into PDFreference I found the following in the
> Appendix D "Encoding".
>
> 1. In PDF 1.3, the euro character was added to the Adobe standard Latin
> character set. It
> is encoded as 200 in WinAnsiEncoding and 240 in PDFDocEncoding,
> assigning codes
> that were previously unused. Apple changed the Mac OS Latin-text
> encoding for code
> 333 from the currency character to the euro character. However, this
> incompatible
> change has not been reflected in PDF’s MacRomanEncoding, which continues
> to map
> code 333 to currency. If the euro character is desired, an encoding
> dictionary can be
> used to specify this single difference from MacRomanEncoding.
> [...]
> 6. The space character is also encoded as 312 in MacRomanEncoding and as
> 240 in
> WinAnsiEncoding. The meaning of this duplicate code is “nonbreaking
> space,” but it is
> typographically the same as space.
>
> As we easily observe, PdfEncoding maps octal 240 (hex a0) to the euro
> symbol - opposed to winansi, which goes the way unicode does.
>
> also forcing unicode did not help: new
> PdfString("1.\u00a0Sammanfattning", PdfObject.TEXT_UNICODE) made no
> difference.
>
>
> After some debugging I found the following lines in PdfString.getBytes():
>
> if (encoding != null && encoding.equals(TEXT_UNICODE) &&
> PdfEncodings.isPdfDocEncoding(value))
> bytes = PdfEncodings.convertToBytes(value, TEXT_PDFDOCENCODING);
> else
> bytes = PdfEncodings.convertToBytes(value, encoding);
>
> I made two observations:
> 1) As long as PdfEncodings.isPdfDocEncoding() believes all symbols to
> belong to PdfDocEncoding, the PdfString will not use a unicode encoded
> TextString but always a PdfDocEncoded one.
> But since there is no non breaking space in
> PdfDocEncodings.isPdfDocEncoding should not match in the first place.
>
> 2) Non the less even when converting to PdfDocEncoding the convert
> method should not wrongly create a € symbol out of the void.
>
>
> Regarding issue 1:
> The following lines are found in PdfEncoding.isPdfDocEncoding():
>
> if (char1 < 128 || (char1 >= 160 && char1 <= 255)) \\ wrongly matching
> non breaking space (dec 160) to be in pdfDocEncoding
> continue;
> if (!pdfEncoding.containsKey(char1))
> return false;
>
> if we replace this line with the following:
>
> if (char1 < 128 || (char1 > 160 && char1 <= 255)) \\ correctly matching
> non breaking space (dec 160) not to be in pdfDocEncoding
> continue;
> if (!pdfEncoding.containsKey(char1))
> return false;
>
> Then isPdfDocEncoding() detects non breaking spaces correctly.
>
> Regarding issue 2:
>
> The following code is found in PdfEncodings.convertToBytes(String, 
> String):
>
> char char1 = cc[k];
> if (char1 < 128 || (char1 >= 160 && char1 <= 255))
> c = char1;
> else
> c = hash.get(char1);
>
> as we can see, char 160 is taken "as is" which indeed causes the
> euro-symbol and thus converts wrong.
> When excluding 160 from the range and taking the char from the given
> hash, than at least itext will not convert wrongly but omit the
> non-convertible character.
>
> So after having found two places where itext (imho) wrongly checks for
> >= 160 where it should check for > 160 I risk the assumption that the
> third place where we compare >= 160 is likely to also be > 160.
>
> Any comments/remarks/flames?
>
> Best regards,
>
> Ludger
>
> -- 
> Dipl-Inf. Ludger Bünger
> Product Development
> Team Martha
> - - - - - - - - - - - - - - - -
> RealObjects GmbH
> Altenkesseler Str. 17/B4
> 66115 Saarbrücken, Germany
> Tel +49 (0)681 98579 0
> Fax +49 (0)681 98579 29
> http://www.realobjects.com
> [EMAIL PROTECTED]
> - - - - - - - - - - - - - - - -
> Commercial Register: Amtsgericht Saarbrücken, HRB 12016
> Managing Directors: Michael Jung, Markus Neurohr
> VAT-ID: DE210373115


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/

Re: [iText-questions] Euro symbol in PdfDoc encoding vs. non breakingspace in unicode...

Reply via email to