Marc-Andre Lemburg <[EMAIL PROTECTED]> added the comment: On 2008-08-29 23:33, Terry J. Reedy wrote: > Terry J. Reedy <[EMAIL PROTECTED]> added the comment: > > "Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16 > vs. UTF-32)" > > I recently read most of the Unicode 5 standard and as near as I could > tell it no longer uses the term UCS, if it ever did.
UCS2 and UCS4 are terms which stem from the versions of Unicode that were current at the time of adding Unicode support to Python, ie. in the year 2000 when ISO 10646 and the Unicode spec co-existed. See http://en.wikipedia.org/wiki/Universal_Character_Set for details. UTF-16 is a transfer encoding that is based on UCS2 by adding surrogate pair interpretations. UTF-32 is the same for UCS4, but also restricting the range of valid code points to the range covered by UTF-16. Whether surrogates are supported or not and how they are supported depends entirely on the codecs you use to convert the internal format to some encoding. > "If it really was UCS-2, the repr wouldn't be u'\U00010123' on windows. > It'd be a pair of ill-formed code units instead." You are mixing the internal representation of Unicode code points with the result of passing those values through one of the codecs, e.g. the unicode-escape codec is responsible for converting between the string representation u'\U00010123' and the internal representation. Also note that because Python can be built using two different internal representations, the results of the codecs may vary depending on platform. BTW: There's no such thing as an ill-formed code unit. What you probably mean is an "ill-formed code unit sequence". However, those refer to the output or accepted input values of a codec, not the internal representation. Please also note that because Python can be used to build valid and parse possibly invalid Unicode encoding data, it has to have the ability to work with Unicode code points regardless of whether they can be interpreted as lone surrogates or not (hence the usage of the terms UCS2/UCS4 which don't support surrogates). Whether the codecs should raise exceptions and possibly let an error handler decide whether or not to accept and/or generate ill-formed code unit sequences is another question. I hope that clears up the reasoning for using UCS2/UCS4 rather than UTF-16/UTF-32 when referring to the internal Unicode representation of Python. _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3297> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com