Re: [Python-Dev] len(chr(i)) = 2?

James Y Knight Tue, 23 Nov 2010 22:30:28 -0800

On Nov 24, 2010, at 12:07 AM, Stephen J. Turnbull wrote:
> By the way, to send the ball back into your court, I have this feeling
> that the demand for UTF-8 is once again driven by native English
> speakers who are very shortly going to find themselves, and the data
> they are most familiar with, very much in the minority.  Of course the
> market that benefits from UTF-8 compression will remain very large for
> the immediate future, but in the grand scheme of things, most of the
> world is going to prefer UTF-16 by a substantial margin.


No, the demand for UTF-8 is because that's what much of the internet (and not 
coincidentally, unix) world has standardized on. The main pieces of software 
using UTF-16 (Windows, Java) started doing so before it became apparent that 16 
bits wasn't enough to  actually hold a unicode codepoint, so they were actually 
implementing UCS-2. In those days, UCS-2 was a fairly sensible choice.

But, now, if your choices are UTF-8 or UTF-16, UTF-8 is clearly superior. Not 
because it's smaller -- it's pretty much a tossup -- but because it is an ASCII 
superset, and thus more easily compatible with other software. That also makes 
it most commonly used for internet communication. (So, there's a huge advantage 
for using it internally as well right there: no transcoding necessary for 
writing your HTML output). UTF-16 is incompatible with ASCII, and furthermore, 
it's still a variable-width encoding, with all the same issues that causes. As 
such, there's really very little to be said in favor of it.

If you really want a fixed-width encoding, you have to go to UTF-32, which is 
excessively large. UTF-32 is a losing choice, simply because of the wasted 
memory usage.

But that's all a side issue: even if you do choose UTF-16 as your underlying 
encoding, you *still* need to provide iterators that work by "byte" (only now 
bytes are 16-bits), by codepoint, and by grapheme. Of course, people who 
implement UTF-16 (such as python, java, and windows) often pretend they're 
still implementing UCS-2, and don't bother even providing their users with the 
necessary APIs to do things correctly. Which, you can often get away 
with...just so long as you don't mind that you sometimes end up splitting a 
string in the middle of a codepoint and causing a unicode error!

James
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] len(chr(i)) = 2?

Reply via email to