Re: [Python-Dev] len(chr(i)) = 2?

Stephen J. Turnbull Wed, 24 Nov 2010 01:55:49 -0800

James Y Knight writes:

 > But, now, if your choices are UTF-8 or UTF-16, UTF-8 is clearly
 > superior [...]a because it is an ASCII superset, and thus more
 > easily compatible with other software. That also makes it most
 > commonly used for internet communication.


Sure, UTF-8 is very nice as a protocol for communicating text.  So
what?  If your application involves shoveling octets real fast, don't
convert and shovel those octets.  If your application involves
significant text processing, well, conversion can almost always be
done as fast as you can do I/O so it doesn't cost wallclock time, and
generally doesn't require a huge percentage of CPU time compared to
the actual text processing.  It's just a specialization of
serialization, that we do all the time for more complex data
structures.

So wire protocols are not a killer argument for or against any
particular internal representation of text.

 > (So, there's a huge advantage for using it internally as well right
 > there: no transcoding necessary for writing your HTML output).

I don't know your use cases but for mine, transcoding (whether in Lisp
or Python or C) is invariably the least of my worries.  *Especially*
transcoding to UTF-8, which is the default codec for me, and I *never*
mix bytes and text, so having not bothered to set the codec, I don't
bother to transcode explicitly.

 > If you really want a fixed-width encoding, you have to go to
 > UTF-32

Not really.  I never bothered implementing the codec, because I
haven't yet seen a non-BMP Unicode character in the wild (I still see
a lot of non-Unicode characters, but hey, that's the price you pay for
living in the land that invented sushi, sake, and anime).  For most
use cases, those are going to be rare, where by "rare" I mean "you
aren't going to see 6400 *different* non-BMP characters."[1]  So
instead of having the codec produce UTF-16, you have it produce (Holy
CEF, Batman!) "pure" UCS-2 with the non-BMP characters registered on
demand and encoded in the BMP private area.  Python, of course, will
never know the difference, and your language won't need to care, either.

 > But that's all a side issue: even if you do choose UTF-16 as your
 > underlying encoding, you *still* need to provide iterators that
 > work by "byte" (only now bytes are 16-bits), by codepoint,

Nope, see above.  Codepoints can be bytes and vice versa.  The needed
codec is no harder to use than any other codec, and only slightly less
efficient than the normal UTF-8 codec unless you're basically
restricted to a rather uncommon script (and even then there are
optimizations).

 > and by grapheme.

Sure, but as I point out elsewhere, the use cases where grapheme
movement is distinguished from character movement I can come up with
are all iterative, and I don't need array behavior for both anyway.
So since I *can* have a character array in Unicode, and I *can't* have
a grapheme array (except maybe by a scheme like the above), I'll go
for the character array.

Unless maybe you convince me I don't need it, but I'm yet to be
convinced.

 > away with...just so long as you don't mind that you sometimes end
 > up splitting a string in the middle of a codepoint and causing a
 > unicode error!

I *do* mind, but I like Python anyway.<wink>


Footnotes: 
[1]  OK, in practice a lot of the private space will be taken by
existing system characters, such as the Apple logo (absolutely
essential for writing email on Mac, at least in Japan).  Whose
use-case is going to see 1000 different non-BMP characters in a
session?  I do know a couple of Buddhist dictionary editors, but
aside from them, I can't think of anybody.  Lara Croft, maybe.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] len(chr(i)) = 2?

Reply via email to