Looks like you found a bug. Paulo
----- Original Message ----- From: "Ludger Bünger" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Tuesday, November 27, 2007 6:22 PM Subject: [iText-questions] Euro symbol in PdfDoc encoding vs. non breakingspace in unicode... > > Hello everyone, > > I believe I have found a bug in Itext but since it is easy to blame > others while overlooking some crucial information, I leave the > evaluation whether this actually is a feature and I am simply mistaken > to the community. > > Consider the following code: > > outlines.set(bookmarkLevel, new PdfOutline( > (PdfOutline)outlines.get(parentBookmarkLevel), > destination, > //outlineTitle > "1.\u00a0Sammanfattning" > ) ); > > > This creates a bookmark in the PDF with the following appearance: > "1.€Sammanfattning". > > Please not the Euro symbol displaying in the bookmark while the original > contains \u00a0 which is a unicode nonbreaking space. > > > After short digging into PDFreference I found the following in the > Appendix D "Encoding". > > 1. In PDF 1.3, the euro character was added to the Adobe standard Latin > character set. It > is encoded as 200 in WinAnsiEncoding and 240 in PDFDocEncoding, > assigning codes > that were previously unused. Apple changed the Mac OS Latin-text > encoding for code > 333 from the currency character to the euro character. However, this > incompatible > change has not been reflected in PDF’s MacRomanEncoding, which continues > to map > code 333 to currency. If the euro character is desired, an encoding > dictionary can be > used to specify this single difference from MacRomanEncoding. > [...] > 6. The space character is also encoded as 312 in MacRomanEncoding and as > 240 in > WinAnsiEncoding. The meaning of this duplicate code is “nonbreaking > space,” but it is > typographically the same as space. > > As we easily observe, PdfEncoding maps octal 240 (hex a0) to the euro > symbol - opposed to winansi, which goes the way unicode does. > > also forcing unicode did not help: new > PdfString("1.\u00a0Sammanfattning", PdfObject.TEXT_UNICODE) made no > difference. > > > After some debugging I found the following lines in PdfString.getBytes(): > > if (encoding != null && encoding.equals(TEXT_UNICODE) && > PdfEncodings.isPdfDocEncoding(value)) > bytes = PdfEncodings.convertToBytes(value, TEXT_PDFDOCENCODING); > else > bytes = PdfEncodings.convertToBytes(value, encoding); > > I made two observations: > 1) As long as PdfEncodings.isPdfDocEncoding() believes all symbols to > belong to PdfDocEncoding, the PdfString will not use a unicode encoded > TextString but always a PdfDocEncoded one. > But since there is no non breaking space in > PdfDocEncodings.isPdfDocEncoding should not match in the first place. > > 2) Non the less even when converting to PdfDocEncoding the convert > method should not wrongly create a € symbol out of the void. > > > Regarding issue 1: > The following lines are found in PdfEncoding.isPdfDocEncoding(): > > if (char1 < 128 || (char1 >= 160 && char1 <= 255)) \\ wrongly matching > non breaking space (dec 160) to be in pdfDocEncoding > continue; > if (!pdfEncoding.containsKey(char1)) > return false; > > if we replace this line with the following: > > if (char1 < 128 || (char1 > 160 && char1 <= 255)) \\ correctly matching > non breaking space (dec 160) not to be in pdfDocEncoding > continue; > if (!pdfEncoding.containsKey(char1)) > return false; > > Then isPdfDocEncoding() detects non breaking spaces correctly. > > Regarding issue 2: > > The following code is found in PdfEncodings.convertToBytes(String, > String): > > char char1 = cc[k]; > if (char1 < 128 || (char1 >= 160 && char1 <= 255)) > c = char1; > else > c = hash.get(char1); > > as we can see, char 160 is taken "as is" which indeed causes the > euro-symbol and thus converts wrong. > When excluding 160 from the range and taking the char from the given > hash, than at least itext will not convert wrongly but omit the > non-convertible character. > > So after having found two places where itext (imho) wrongly checks for > >= 160 where it should check for > 160 I risk the assumption that the > third place where we compare >= 160 is likely to also be > 160. > > Any comments/remarks/flames? > > Best regards, > > Ludger > > -- > Dipl-Inf. Ludger Bünger > Product Development > Team Martha > - - - - - - - - - - - - - - - - > RealObjects GmbH > Altenkesseler Str. 17/B4 > 66115 Saarbrücken, Germany > Tel +49 (0)681 98579 0 > Fax +49 (0)681 98579 29 > http://www.realobjects.com > [EMAIL PROTECTED] > - - - - - - - - - - - - - - - - > Commercial Register: Amtsgericht Saarbrücken, HRB 12016 > Managing Directors: Michael Jung, Markus Neurohr > VAT-ID: DE210373115 ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://itext.ugent.be/itext-in-action/
