Shane Hathaway wrote: > Martin v. Löwis wrote: > >>Shane Hathaway wrote: >> >> >>>I agree that UCS4 is needed. There is a balancing act here; UTF-16 is >>>widely used and takes less space, while UCS4 is easier to treat as an >>>array of characters. Maybe we can have both: unicode objects start with >>>an internal representation in UTF-16, but get promoted automatically to >>>UCS4 when you index or slice them. The difference will not be visible >>>to Python code. A compile-time switch will not be necessary. What do >>>you think? >> >> >>This breaks backwards compatibility with existing extension modules. >>Applications that do PyUnicode_AsUnicode get a Py_UNICODE*, and >>can use that to directly access the characters. > > > Py_UNICODE would always be 32 bits wide. PyUnicode_AsUnicode would > cause the unicode object to be promoted automatically. Extensions that > break as a result are technically broken already, aren't they? They're > not supposed to depend on the size of Py_UNICODE.
-1. You are free to compile Python with --enable-unicode=ucs4 if you prefer this setting. I don't see any reason why we should force users to invest 4 bytes of storage for each Unicode code point - 2 bytes work just fine and can represent all Unicode characters that are currently defined (using surrogates if necessary). As more and more Unicode objects are used in a process, choosing UCS2 vs. UCS4 does make a huge difference in terms of used memory. All this talk about UTF-16 vs. UCS-2 is not very useful and strikes me a purely academic. The reference to possibly breakage by slicing a Unicode and breaking a surrogate pair is valid, the idea of UCS-4 being less prone to breakage is a myth: Unicode has many code points that are meant only for composition and don't have any standalone meaning, e.g. a combining acute accent (U+0301), yet they are perfectly valid code points - regardless of UCS-2 or UCS-4. It is easily possible to break such a combining sequence using slicing, so the most often presented argument for using UCS-4 instead of UCS-2 (+ surrogates) is rather weak if seen by daylight. Some may now say that combining sequences are not used all that often. However, they play a central role in Unicode normalization (http://www.unicode.org/reports/tr15/), which is needed whenever you want to semantically compare Unicode objects and are -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com