Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> writes: > (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters > using two code points. This is fragile and doesn't work very well, > because string-handling methods can break the surrogate pairs apart, > leaving you with invalid unicode string. Not good.) ... > With PEP 393, each Python string will be stored in the most efficient > format possible:
Can you explain the issue of "breaking surrogate pairs apart" a little more? Switching between encodings based on the string contents seems silly at first glance. Strings are immutable so I don't understand why not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in Latin-based alphabets and UTF-16 may be more efficient for some other languages. I think even UCS-4 doesn't completely fix the surrogate pair issue if it means the only thing I can think of. -- http://mail.python.org/mailman/listinfo/python-list