On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman <v+pyt...@g.nevcal.com> wrote: > The str type itself can presently be used to process other > character encodings: if they are fixed width < 32-bit elements those > encodings might be considered Unicode encodings, but there is no requirement > that they are, and some operations on str may operate with knowledge of some > Unicode semantics, so there are caveats.
Actually, the str type in Python 3 and the unicode type in Python 2 are constrained everywhere to either 16-bit or 21-bit "characters". (Except when writing C code, which can do any number of invalid things so is the equivalent of assuming 1 == 0.) In particular, on a wide build, there is no way to get a code point >= 2**21, and I don't want PEP 393 to change this. So at best we can use these types to repesent arrays of 21-bit unsigned ints. But I think it is more useful to think of them as always representing "some form of Unicode", whether that is UTF-16 (on narrow builds) or 21-bit code points or perhaps some vaguely similar superset -- but for those code units/code points that are representable *and* valid (either code points or code units) according to the (supported version of) the Unicode standard, the meaning of those code points/units matches that of the standard. Note that this is different from the bytes type, where the meaning of a byte is entirely determined by what it means in the programmer's head. -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com