M.-A. Lemburg wrote:
Note that using UTF-8 as internal storage format would not work in Python, since Python is a Unicode producer, i.e. it needs to be able to generate and work with code points that are not allowed in UTF-8, e.g. lone surrogates.
Well, it wouldn't strictly be UTF-8, any more than the 2-byte build is strictly UTF-16, in the sense that lone surrogates can be produced.
Another reason not to use UTF-8 encoded code units is that slicing based on code units could easily create invalid UTF-8 which would then render the data unusable. This is a lot less likely to happen with UCS2 or UCS4.
The use cases I had in mind for a 1-byte build are those for which the alternative would be keeping everything in bytes. Applications using a 1-byte build would need to be aware of the fact and take care to slice strings at valid places. If they were using bytes, they would have to face exactly the same issues.
And finally: RAM is cheap and today's CPUs work better with 16- or 32-bit values than 8-bit characters.
Yet some people have reported significant performance benefits for some applications from using a 2-byte build instead of a 4-byte build. I was just speculating whether a 1-byte build might be of further advantage in a few specialised cases. No matter how much RAM or processing speed you have, it's always possible to find an application that stresses the limits. -- Greg _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com