Re: [Python-3000] String comparison

Rauli Ruohonen Thu, 14 Jun 2007 05:34:25 -0700

On 6/14/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On 6/13/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> > A code point is something that has a 1:1 relationship with a logical
> > character (in particular, a Unicode character).


As the word "character" is ambiguous, I'd put it this way:

- code point: the smallest unit Unicode deals with that's independent of
  encoding. Takes values in range(0, 0x110000)
- grapheme (or "grapheme cluster"): what users think of as characters. May
  consist of multiple code points, e.g. "ö" can be represented with one
  or two code points. Depends on the language the user speaks

> It sounds like we really use code units, not code points (except when
> building with the 4-byte Unicode option, when they are equivalent).

Not quite equivalent in current Python. From some past discussions I thought
this was by design, but now having seen this odd behavior, maybe it isn't:

>>> sys.maxunicode
1114111
>>> x = u'\ud840\udc21'
>>> marshal.loads(marshal.dumps(x)) == x
False
>>> pickle.loads(pickle.dumps(x, 2)) == x
False
>>> pickle.loads(pickle.dumps(x, 1)) == x
False
>>> pickle.loads(pickle.dumps(x)) == x
True
>>>

Pickling should work the same way regardless of protocol, right? And
probably should not modify the objects it pickles if it can help it.
The reason the above happens is that binary pickles use UTF-8 to encode
unicode, and this is what happens with codecs:

>>> u'\ud840\udc21' == u'\U00020021'
False
>>> u'\ud840\udc21'.encode('utf-8').decode('utf-8')
u'\U00020021'
>>> u'\ud840\udc21'.encode('punycode').decode('punycode')
u'\ud840\udc21'
>>> u'\ud840\udc21'.encode('utf-16').decode('utf-16')
u'\U00020021'
>>> u'\U00020021'.encode('utf-16').decode('utf-16')
u'\U00020021'
>>> u'\ud840\udc21'.encode('big5hkscs').decode('big5hkscs')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'big5hkscs' codec can't encode character u'\ud840'
in position 0: illegal multibyte sequence
>>> u'\U00020021'.encode('big5hkscs').decode('big5hkscs')
u'\U00020021'
>>>

Should codecs treat u'\ud840\udc21' and u'\U00020021' the same even on
UCS-4 builds (like current UTF-8 and UTF-16 codecs do) or not (like current
punycode and big5hkscs codecs do)?
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] String comparison

Reply via email to