At Friday 29/9/2006 04:52, Lawrence D'Oliveiro wrote: > >> Is there a way to calculate in characters > >> and not in bytes to represent the characters. > > > > Decode the byte string and use `len()` on the unicode string. > >Hmmm, for some reason > > len(u"C\u0327") > >returns 2.
That's correct, these are two unicode characters, C and combining-cedilla; display as Ç. From <http://en.wikipedia.org/wiki/Unicode>: "Unicode takes the role of providing a unique code point a number, not a glyph for each character. In other words, Unicode represents a character in an abstract way, and leaves the visual rendering (size, shape, font or style) to other software [...] This simple aim becomes complicated, however, by concessions made by Unicode's designers, in the hope of encouraging a more rapid adoption of Unicode. [...] A lot of essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore allow conversion from those encodings to Unicode (and back) without losing any information. [...] Also, while Unicode allows for combining characters, it also contains precomposed versions of most letter/diacritic combinations in normal use. These make conversion to and from legacy encodings simpler and allow applications to use Unicode as an internal text format without having to implement combining characters. For example é can be represented in Unicode as U+0065 (Latin small letter e) followed by U+0301 (combining acute) but it can also be represented as the precomposed character U+00E9 (Latin small letter e with acute)." Gabriel Genellina Softlab SRL __________________________________________________ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas -- http://mail.python.org/mailman/listinfo/python-list