Re: Why UTF-8/16 character encodings?

Joakim Sun, 26 May 2013 03:00:27 -0700

For some reason this posting by H. S. Teoh shows up on themailing list but not on the forum.


On Sat May 25 13:42:10 PDT 2013, H. S. Teoh wrote:
On Sat, May 25, 2013 at 10:07:41AM +0200, Joakim wrote:

The vast majority of non-english alphabets in UCS can beencoded in
a single byte.  It is your exceptions that are not relevant.
I'll have you know that Chinese, Korean, and Japanese accountfor a
significant percentage of the world's population, and therefore
arguments about "vast majority" are kinda missing the forestfor thetrees. If you count the number of *alphabets* that can beencoded in asingle byte, you can get a majority, but that in no wayreflects actual
usage.

Not just "a majority," the vast majority of alphabets,representing 85% of the world's population.

>The only alternatives to a variable width encoding I can see>are:
>- Single code page per string
>This is completely useless because now you can't concatenate
>strings of different code pages.
I wouldn't be so fast to ditch this. There is a real argumentto bemade that strings of different languages are sufficientlydifferentthat there should be no multi-language strings. Is this thebestroute? I'm not sure, but I certainly wouldn't dismiss it outof hand.
This is so patently absurd I don't even know how to begin toanswer...have you actually dealt with any significant amount of text atall? Alarge amount of text in today's digital world are at leastbilingual, ifnot more. Even in pure English text, you occasionally need aforeignletter in order to transcribe a borrowed/quoted word, e.g.,"cliché","naïve", etc.. Under your scheme, it would be impossible toencode anytext that contains even a single instance of such words. All ittakes is*one* word in a 500-page text and your scheme breaks down, andwe'reback to the bad ole days of codepages. And yes you can say"well justinclude é and ï in the English code page". But then all ittakes is asingle math formula that requires a Greek letter, and your textisnon-encodable anymore. By the time you pull in all the French,German,Greek letters and math symbols, you might as well just go backto UTF-8.

I think you misunderstand what this implies. I mentioned itearlier as another possibility to Walter, "keep all your stringsin a single language, with a different format to compose themtogether." Nobody is talking about disallowing alphabets otherthan English or going back to code pages. The fundamentalquestion is whether it makes sense to combine all these differentalphabets and their idiosyncratic rules into a single string andencoding.

There is a good argument to be made that the differences outweighthe similarities and you'd be better off keeping eachlanguage/alphabet in its own string. It's a question ofmodeling, just like a class hierarchy. As I said, I'm not surethis is the best route, but it has some real strengths.

The alternative is to have embedded escape sequences for therareforeign letter/word that you might need, but then you're backto beingunable to slice the string at will, since slicing it at thewrong place
will produce gibberish.

No one has presented this as a viable option.

I'm not saying UTF-8 (or UTF-16, etc.) is panacea -- there arethingsabout it that are annoying, but it's certainly better than thescheme
you're proposing.

I disagree.

On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:

And just how exactly does that help with slicing? If anything,it makesslicing way hairier and error-prone than UTF-8. In fact, thisone pointalone already defeated any performance gains you may have hadwith asingle-byte encoding. Now you can't do *any* slicing at allwithout
convoluted algorithms to determine what encoding is where at the
endpoints of your slice, and the resulting slice must have newheadersto indicate the start/end of every different-languagesubstring. By thetime you're done with all that, you're going way slower thanprocessing
UTF-8.

There are no convoluted algorithms, it's a simple check if thestring contains any two-bye encodings, a check which can be doneonce and cached. If it's single-byte all the way through, noproblems whatsoever with slicing. If there are two-bytelanguages included, the slice function will have to do a littlearithmetic calculation before slicing. You will also need a fewarithmetic ops to create the new header for the slice. The pointis that these operations will be much faster than decoding everycode point to slice UTF-8.

Again I say, I'm not 100% sold on UTF-8, but what you'reproposing here
is far worse.

Well, I'm glad you realize some problems with UTF-8, :) even ifyou dismiss my alternative out of hand.

Re: Why UTF-8/16 character encodings?

Reply via email to