On May 10, 2005, at 2:48 PM, Nicholas Bastin wrote: > On May 9, 2005, at 12:59 AM, Martin v. Löwis wrote: > > >>> Wow, what an inane way of looking at it. I don't know what world >>> you >>> live in, but in my world, users read the configure options and >>> suppose >>> that they mean something. In fact, they *have* to go off on >>> their own >>> to assume something, because even the documentation you refer to >>> above >>> doesn't say what happens if they choose UCS-2 or UCS-4. A logical >>> assumption would be that python would use those CEFs internally, and >>> that would be incorrect. >>> >> >> Certainly. That's why the documentation should be improved. Changing >> the option breaks existing packaging systems, and should not be done >> lightly. >> > > I'm perfectly happy to continue supporting --enable-unicode=ucs2, > but not displaying it as an option. Is that acceptable to you? >
If you're going to call python's implementation UTF-16, I'd consider all these very serious deficiencies: - unicodedata doesn't work for 2-char strings containing a surrogate pairs, nor integers. Therefore it is impossible to get any data on chars > 0xFFFF. - there are no methods for determining if something is a surrogate pair and turning it into a integer codepoint. - Given that unicodedata doesn't work, I doubt also that .toupper/etc work right on surrogate pairs, although I haven't tested. - As has been noted before, the regexp engine doesn't properly treat surrogate pairs as a single unit. - Is there a method that is like unichr but that will work for codepoints > 0xFFFF? I'm sure there's more as well. I think it's a mistake to consider python to be implementing UTF-16 just because it properly encodes/ decodes surrogate pairs in the UTF-8 codec. James _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com