James Y Knight writes: > But, now, if your choices are UTF-8 or UTF-16, UTF-8 is clearly > superior [...]a because it is an ASCII superset, and thus more > easily compatible with other software. That also makes it most > commonly used for internet communication.
Sure, UTF-8 is very nice as a protocol for communicating text. So what? If your application involves shoveling octets real fast, don't convert and shovel those octets. If your application involves significant text processing, well, conversion can almost always be done as fast as you can do I/O so it doesn't cost wallclock time, and generally doesn't require a huge percentage of CPU time compared to the actual text processing. It's just a specialization of serialization, that we do all the time for more complex data structures. So wire protocols are not a killer argument for or against any particular internal representation of text. > (So, there's a huge advantage for using it internally as well right > there: no transcoding necessary for writing your HTML output). I don't know your use cases but for mine, transcoding (whether in Lisp or Python or C) is invariably the least of my worries. *Especially* transcoding to UTF-8, which is the default codec for me, and I *never* mix bytes and text, so having not bothered to set the codec, I don't bother to transcode explicitly. > If you really want a fixed-width encoding, you have to go to > UTF-32 Not really. I never bothered implementing the codec, because I haven't yet seen a non-BMP Unicode character in the wild (I still see a lot of non-Unicode characters, but hey, that's the price you pay for living in the land that invented sushi, sake, and anime). For most use cases, those are going to be rare, where by "rare" I mean "you aren't going to see 6400 *different* non-BMP characters."[1] So instead of having the codec produce UTF-16, you have it produce (Holy CEF, Batman!) "pure" UCS-2 with the non-BMP characters registered on demand and encoded in the BMP private area. Python, of course, will never know the difference, and your language won't need to care, either. > But that's all a side issue: even if you do choose UTF-16 as your > underlying encoding, you *still* need to provide iterators that > work by "byte" (only now bytes are 16-bits), by codepoint, Nope, see above. Codepoints can be bytes and vice versa. The needed codec is no harder to use than any other codec, and only slightly less efficient than the normal UTF-8 codec unless you're basically restricted to a rather uncommon script (and even then there are optimizations). > and by grapheme. Sure, but as I point out elsewhere, the use cases where grapheme movement is distinguished from character movement I can come up with are all iterative, and I don't need array behavior for both anyway. So since I *can* have a character array in Unicode, and I *can't* have a grapheme array (except maybe by a scheme like the above), I'll go for the character array. Unless maybe you convince me I don't need it, but I'm yet to be convinced. > away with...just so long as you don't mind that you sometimes end > up splitting a string in the middle of a codepoint and causing a > unicode error! I *do* mind, but I like Python anyway.<wink> Footnotes: [1] OK, in practice a lot of the private space will be taken by existing system characters, such as the Apple logo (absolutely essential for writing email on Mac, at least in Japan). Whose use-case is going to see 1000 different non-BMP characters in a session? I do know a couple of Buddhist dictionary editors, but aside from them, I can't think of anybody. Lara Croft, maybe. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com