On 9/18/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > On 9/18/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > > > There's no UTF-8 in Python's internal string encoding. What are you > > talking about? > > (At least as of a few days ago) > > In Python 3 there is; strings are unicode. A PyUnicodeObject object > has two encodings that you can grab from a pointer (which means they > have to be there; you don't have time to generate them like you would > with a function pointer).
Incorrect. The pointer can be NULL. The API for getting the UTF-8 encoding is a function (moreover a function whose name starts with _Py). > One of these (str) is the "internal encoding" which is chosen at > compile time, and the other (defenc) is now hard-coded to UTF-8. > > Hashing is also based on the UTF-8 bytestring. Not any more as of a few hours ago; the hashing based on UTF-8 was excessively expensive, and I rewrote it to directly use the code units(?) (or whatever they are called -- the Py_UNICODE values). For strings not using code units(?) > 2**16 this will give the same value on all platforms; if there are code units(?) >= 2**16 results vary since these will be represented as surrogates on 2-byte systems but not on 4-byte systems. -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com