Re: [Python-3000] Unicode and OS strings

Guido van Rossum Tue, 18 Sep 2007 14:26:21 -0700

On 9/18/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> On 9/18/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
>
> > There's no UTF-8 in Python's internal string encoding.  What are you
> > talking about?
>
> (At least as of a few days ago)
>
> In Python 3 there is; strings are unicode.  A PyUnicodeObject object
> has two encodings that you can grab from a pointer (which means they
> have to be there; you don't have time to generate them like you would
> with a function pointer).


Incorrect. The pointer can be NULL. The API for getting the UTF-8
encoding is a function (moreover a function whose name starts with
_Py).

> One of these (str) is the "internal encoding" which is chosen at
> compile time, and the other (defenc) is now hard-coded to UTF-8.
>
> Hashing is also based on the UTF-8 bytestring.

Not any more as of a few hours ago; the hashing based on UTF-8 was
excessively expensive, and I rewrote it to directly use the code
units(?) (or whatever they are called -- the Py_UNICODE values). For
strings not using code units(?) > 2**16 this will give the same value
on all platforms; if there are code units(?) >= 2**16 results vary
since these will be represented as surrogates on 2-byte systems but
not on 4-byte systems.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] Unicode and OS strings

Reply via email to