Re: [Python-Dev] thoughts on the bytes/string discussion

Greg Ewing Wed, 07 Jul 2010 03:49:01 -0700

M.-A. Lemburg wrote:

Note that using UTF-8 as internal storage format would not work
in Python, since Python is a Unicode producer, i.e. it needs to
be able to generate and work with code points that are not allowed
in UTF-8, e.g. lone surrogates.


Well, it wouldn't strictly be UTF-8, any more than the
2-byte build is strictly UTF-16, in the sense that lone
surrogates can be produced.

Another reason not to use UTF-8 encoded code units is that slicing
based on code units could easily create invalid UTF-8 which would
then render the data unusable. This is a lot less likely to happen
with UCS2 or UCS4.


The use cases I had in mind for a 1-byte build are those for
which the alternative would be keeping everything in bytes.
Applications using a 1-byte build would need to be aware of
the fact and take care to slice strings at valid places. If
they were using bytes, they would have to face exactly the
same issues.

And finally: RAM is cheap and today's CPUs work better with 16- or
32-bit values than 8-bit characters.


Yet some people have reported significant performance benefits
for some applications from using a 2-byte build instead of a
4-byte build. I was just speculating whether a 1-byte build
might be of further advantage in a few specialised cases.

No matter how much RAM or processing speed you have, it's always
possible to find an application that stresses the limits.

--
Greg

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] thoughts on the bytes/string discussion

Reply via email to