On May 6, 2005, at 8:25 PM, Martin v. Löwis wrote: > Nicholas Bastin wrote: >> Yes. Not only in my mind, but in the Python source code. If >> Py_UNICODE is 4 bytes wide, then the encoding is UTF-32 (UCS-4), >> otherwise the encoding is UTF-16 (*not* UCS-2). > > I see. Some people equate "encoding" with "encoding scheme"; > neither UTF-32 nor UTF-16 is an encoding scheme. You were
That's not true. UTF-16 and UTF-32 are both CES and CEF (although this is not true of UTF-16LE and BE). UTF-32 is a fixed-width encoding form within a code space of (0..10FFFF) and UTF-16 is a variable-width encoding form which provides a mix of one of two 16-bit code units in the code space of (0..FFFF). However, you are perhaps right to point out that people should be more explicit as to which they are referring to. UCS-2, however, is only a CEF, and thus I thought it was obvious that I was referring to UTF-16 as a CEF. I would point anyone who is confused as this point to Unicode Technical Report #17 on the Character Encoding Model, which is much more clear than trying to piece together the relevant parts out of the entire standard. In any event, Python's use of the term UCS-2 is incorrect. I quote from the TR: "The UCS-2 encoding form, which is associated with ISO/IEC 10646 and can only express characters in the BMP, is a fixed-width encoding form." immediately followed by: "In contrast, UTF-16 uses either one or two code units and is able to cover the entire code space of Unicode." If Python is capable of representing the entire code space of Unicode when you choose --unicode=ucs2, then that is a bug. It either should not be called UCS-2, or the interpreter should be bound by the limitations of the UCS-2 CEF. >> What I mean by 'variable' is that you can't make any assumption as to >> what the size will be in any given python when you're writing (and >> building) an extension module. This breaks binary compatibility of >> extensions modules on the same platform and same version of python >> across interpreters which may have been built with different configure >> options. > > True. The breakage will be quite obvious, in most cases: the module > fails to load because not only sizeof(Py_UNICODE) changes, but also > the names of all symbols change. Yes, but the important question here is why would we want that? Why doesn't Python just have *one* internal representation of a Unicode character? Having more than one possible definition just creates problems, and provides no value. -- Nick _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com