On Thu, Jul 25, 2013 at 7:22 PM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > What I'm trying to say is that it is possible to use UTF-16 internally, > but *not* assume that every code point (character) is represented by a > single 2-byte unit. For example, the len() of a UTF-16 string should not > be calculated by counting the number of bytes and dividing by two. You > actually need to walk the string, inspecting each double-byte
Anything's possible. But since underlying representations can be changed fairly easily (relative term of course - it's a lot of work, but it can be changed in a single release, no deprecation required or anything), there's very little reason to continue using UTF-16 underneath. May as well switch to UTF-32 for convenience, or PEP 393 for convenience and efficiency, or maybe some other system that's still mostly fixed-width. ChrisA -- http://mail.python.org/mailman/listinfo/python-list