On Wed, Nov 24, 2010 at 9:17 PM, Stephen J. Turnbull <step...@xemacs.org> wrote: .. > > I note that an opinion has been raised on this thread that > > if we want compressed internal representation for strings, we should > > use UTF-8. I tend to agree, but UTF-8 has been repeatedly rejected as > > too hard to implement. What makes UTF-16 easier than UTF-8? Only the > > fact that you can ignore bugs longer, in my view. > > That's mostly true. My guess is that we can probably ignore those > bugs for as long as it takes someone to write the higher-level > libraries that James suggests and MAL has actually proposed and > started a PEP for. >
As far as I can tell, that PEP generated grand total of one comment in nine years. This may or may not be indicative of how far away we are from seeing it implemented. :-) As far as UTF-8 vs. UCS-2/4 debate, I have an idea that may be even more far fetched. Once upon a time, Python Unicode strings supported buffer protocol and would lazily fill an internal buffer with bytes in the default encoding. In 3.x the default encoding has been fixed as UTF-8, buffer protocol support was removed from strings, but the internal buffer caching (now UTF-8) encoded representation remained. Maybe we can now implement defenc logic in reverse. Recall that strings are stored as UCS-2/4 sequences, but once buffer is requested in 2.x Python code or char* is obtained via _PyUnicode_AsStringAndSize() at the C level in 3.x, an internal buffer is filled with UTF-8 bytes and defenc is set to point to that buffer. So the idea is for strings to store their data as UTF-8 buffer pointed by defenc upon construction. If an application uses string indexing, UTF-8 only strings will lazily fill their UCS-2/4 buffer. Proper, Unicode-aware algorithms such as grapheme, word or line iteration or simple operations such as concatenation, search or substitution would operate directly on defenc buffers. Presumably over time fewer and fewer applications would use code unit indexing that require UCS-2/4 buffer and eventually Python strings can stop supporting indexing altogether just like they stopped supporting the buffer protocol in 3.x. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com