Hello, 2008/7/3 Guido van Rossum <[EMAIL PROTECTED]>: > I don't see an answer there to the question of whether the length() > method of a Java String object containing a single surrogate pair > returns 1 or 2; I suspect it returns 2. Python 3 supports things like > chr(0x12345) and ord("\U00012345"). (And so does Python 2, using > unichr and unicode literals.)
python2.6 support for supplementary characters is not ideal: >>> unichr(0x2f81a) ValueError: unichr() arg not in range(0x10000) (narrow Python build) >>> ord(u'\U0002F81A') TypeError: ord() expected a character, but string of length 2 found. \Uxxxxxxxx seems the only way to enter these characters. 3.0 is much better and passes the two tests above. The unicodedata module gives good results in both versions: >>> unicodedata.name(u'\U0002F81A') 'CJK COMPATIBILITY IDEOGRAPH-2F81A' [34311 refs] >>> unicodedata.category(u'\U0002F81A') 'Lo' With python 3.0, I found only two places that refuse large code points on narrow builds: the "%c" format, and Py_BuildValue('C'). They should be fixed. > The one thing that may be missing from Python is things like > interpretation of surrogates by functions like isalpha() and I'm okay > with adding that (since those have to loop over the entire string > anyway). In this case, a new .isascii() method would be needed for some uses. -- Amaury Forgeot d'Arc _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com