On 27 Jun, 2010, at 11:48, Greg Ewing wrote: > Stefan Behnel wrote: >> Greg Ewing, 26.06.2010 09:58: >>> Would there be any sanity in having an option to compile >>> Python with UTF-8 as the internal string representation? >> It would break Py_UNICODE, because the internal size of a unicode character >> would no longer be fixed. > > It's not fixed anyway with the 2-char build -- some > characters are represented using a pair of surrogates.
It is for practical purposes not even fixed in 4-char builds. In 4-char builds every Unicode code points corresponds to one item in a python unicode string, but a base characters with combining characters is still a sequence of characters and should IMHO almost always be treated as a single object. As an example, given s="be\N{COMBINING DIAERESIS}" s[:2] or s[2:] is almost certainly semanticly invalid. Ronald
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com