My apologies for hammering on this, but I think it is quite important and currently Python 3.0 seems confused about UCS-2 versus UTF-16.
-On [20080702 20:47], Guido van Rossum ([EMAIL PROTECTED]) wrote: >No, Python already is aware of surrogates. I meant applications >processing non-BMP text should beware of them. Just to make sure people are fully aware of the distinctions: UCS-2 uses 16 bits to encode Unicode data, does NOT support surrogate pairs and therefore CANNOT represent data beyond U+FFFF (thus only supporting the Basic Multilingual Plane, BMP). It is a fixed-length character encoding. UTF-16 also uses 16 bits to encode Unicode data, but DOES support surrogate pairs and therefore CAN represent data beyond U+FFFF by using said surrogate pairs (thus supporting all planes). It is a variable-length character encoding. So a string representation in UCS-2 means every character occupies 16 bits. A string representation in UTF-16 means characters can occupy 16 bits or 32-bits. If one stays within the BMP than all is well, but when you move beyond the BMP (U+10000 - U+10FFFF) then Python needs to correctly check the string for surrogate pairs and deal with them internally. >If you find places where the Python core or standard library is doing >Unicode processing that would break when surrogates are present you >should file a bug. However this does not mean that every bit of code >that slices a string at an arbitrary point (and hence risks slicing in >the middle of a surrogate) is incorrect -- it all depends on what is >done next with the slice. Basically everything but string forming or string printing seems to be broken for surrogate pairs, from what I can tell. Also, I think you are confused about slicing in the middle of a surrogate pair, from a UTF-16 perspective this is 1 codepoint! And as such Python needs to treat it as one character/codepoint in a string, dealing with slicing as appropriate. The way you currently describe it is that UTF-16 strings will be treated as UCS-2 when it comes to slicing and the likes. From a UTF-16 point of view such slicing can NEVER occur unless you are bit or byte slicing instead of character/codepoint slicing. The documentation for len() says: Return the length (the number of items) of an object. I think it can be fairly said that an item in a string is a character or codepoint. Take for example the following string: a = '\U00020045\u942a' # Two hanzi/kanji/hanja From a Unicode perspective we are looking at two characters/codepoints. When we use a 4-byte Python 3.0 binary we get (as expected): >>> len(a) 2 When we use a 2-byte Python 3.0 binary (the default) we get (not as expected): >>> len(a) 3 From a UTF-16 perspective a surrogate pair is one character/codepoint and as such len() should have reported 2 as well. That the sequence is stored internally as 0xd840 0xdc45 0x942a and occupies 3 bytes is not interesting. But it seems as if len() is treating the string as being in UCS-2 (fixed-length), which is the only logical explanation for the number 3, instead of treating it as UTF-16 (variable-length) and reporting the number 2. Subsequently doing a: print a[1] to get the 0x942a (鐪) actually requires a[2] on the 2-byte Python 3.0. As such the code you write for 2-byte and 4-byte Python 3.0 is *different* when you have to deal with the same Unicode strings! This cannot be the desired situation, can it? Two more examples: >>> a.find('鐪') # 4-byte 1 >>> a.find('鐪') # 2-byte 2 >>> import re # 4-byte >>> m = re.search('鐪', a) >>> m.start() 1 >>> import re # 2-byte >>> m = re.search('鐪', a) >>> m.start() 2 This, in my opinion, has nothing to do with the application writers, but more with Python's internals being confused about UCS-2 and UTF-16. We accept full 32-bit codepoints with the \U escape in strings, and we may even store it as UTF-16 internally, but we clearly do not deal with it properly as UTF-16, but rather as UCS-2, when it comes to using said strings with core functions and modules. -- Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B For wouldst thou not carve at my Soul with thine sword of Supreme Truth? _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com