On 6/14/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On 6/13/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > A code point is something that has a 1:1 relationship with a logical > > character (in particular, a Unicode character).
As the word "character" is ambiguous, I'd put it this way: - code point: the smallest unit Unicode deals with that's independent of encoding. Takes values in range(0, 0x110000) - grapheme (or "grapheme cluster"): what users think of as characters. May consist of multiple code points, e.g. "ö" can be represented with one or two code points. Depends on the language the user speaks > It sounds like we really use code units, not code points (except when > building with the 4-byte Unicode option, when they are equivalent). Not quite equivalent in current Python. From some past discussions I thought this was by design, but now having seen this odd behavior, maybe it isn't: >>> sys.maxunicode 1114111 >>> x = u'\ud840\udc21' >>> marshal.loads(marshal.dumps(x)) == x False >>> pickle.loads(pickle.dumps(x, 2)) == x False >>> pickle.loads(pickle.dumps(x, 1)) == x False >>> pickle.loads(pickle.dumps(x)) == x True >>> Pickling should work the same way regardless of protocol, right? And probably should not modify the objects it pickles if it can help it. The reason the above happens is that binary pickles use UTF-8 to encode unicode, and this is what happens with codecs: >>> u'\ud840\udc21' == u'\U00020021' False >>> u'\ud840\udc21'.encode('utf-8').decode('utf-8') u'\U00020021' >>> u'\ud840\udc21'.encode('punycode').decode('punycode') u'\ud840\udc21' >>> u'\ud840\udc21'.encode('utf-16').decode('utf-16') u'\U00020021' >>> u'\U00020021'.encode('utf-16').decode('utf-16') u'\U00020021' >>> u'\ud840\udc21'.encode('big5hkscs').decode('big5hkscs') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'big5hkscs' codec can't encode character u'\ud840' in position 0: illegal multibyte sequence >>> u'\U00020021'.encode('big5hkscs').decode('big5hkscs') u'\U00020021' >>> Should codecs treat u'\ud840\udc21' and u'\U00020021' the same even on UCS-4 builds (like current UTF-8 and UTF-16 codecs do) or not (like current punycode and big5hkscs codecs do)? _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com