For some reason this posting by H. S. Teoh shows up on the mailing list but not on the forum.

On Sat May 25 13:42:10 PDT 2013, H. S. Teoh wrote:
On Sat, May 25, 2013 at 10:07:41AM +0200, Joakim wrote:
The vast majority of non-english alphabets in UCS can be encoded in
a single byte.  It is your exceptions that are not relevant.

I'll have you know that Chinese, Korean, and Japanese account for a
significant percentage of the world's population, and therefore
arguments about "vast majority" are kinda missing the forest for the trees. If you count the number of *alphabets* that can be encoded in a single byte, you can get a majority, but that in no way reflects actual
usage.
Not just "a majority," the vast majority of alphabets, representing 85% of the world's population.

>The only alternatives to a variable width encoding I can see >are:
>- Single code page per string
>This is completely useless because now you can't concatenate
>strings of different code pages.
I wouldn't be so fast to ditch this. There is a real argument to be made that strings of different languages are sufficiently different that there should be no multi-language strings. Is this the best route? I'm not sure, but I certainly wouldn't dismiss it out of hand.

This is so patently absurd I don't even know how to begin to answer... have you actually dealt with any significant amount of text at all? A large amount of text in today's digital world are at least bilingual, if not more. Even in pure English text, you occasionally need a foreign letter in order to transcribe a borrowed/quoted word, e.g., "cliché", "naïve", etc.. Under your scheme, it would be impossible to encode any text that contains even a single instance of such words. All it takes is *one* word in a 500-page text and your scheme breaks down, and we're back to the bad ole days of codepages. And yes you can say "well just include é and ï in the English code page". But then all it takes is a single math formula that requires a Greek letter, and your text is non-encodable anymore. By the time you pull in all the French, German, Greek letters and math symbols, you might as well just go back to UTF-8.
I think you misunderstand what this implies. I mentioned it earlier as another possibility to Walter, "keep all your strings in a single language, with a different format to compose them together." Nobody is talking about disallowing alphabets other than English or going back to code pages. The fundamental question is whether it makes sense to combine all these different alphabets and their idiosyncratic rules into a single string and encoding.

There is a good argument to be made that the differences outweigh the similarities and you'd be better off keeping each language/alphabet in its own string. It's a question of modeling, just like a class hierarchy. As I said, I'm not sure this is the best route, but it has some real strengths.

The alternative is to have embedded escape sequences for the rare foreign letter/word that you might need, but then you're back to being unable to slice the string at will, since slicing it at the wrong place
will produce gibberish.
No one has presented this as a viable option.

I'm not saying UTF-8 (or UTF-16, etc.) is panacea -- there are things about it that are annoying, but it's certainly better than the scheme
you're proposing.
I disagree.

On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:
And just how exactly does that help with slicing? If anything, it makes slicing way hairier and error-prone than UTF-8. In fact, this one point alone already defeated any performance gains you may have had with a single-byte encoding. Now you can't do *any* slicing at all without
convoluted algorithms to determine what encoding is where at the
endpoints of your slice, and the resulting slice must have new headers to indicate the start/end of every different-language substring. By the time you're done with all that, you're going way slower than processing
UTF-8.
There are no convoluted algorithms, it's a simple check if the string contains any two-bye encodings, a check which can be done once and cached. If it's single-byte all the way through, no problems whatsoever with slicing. If there are two-byte languages included, the slice function will have to do a little arithmetic calculation before slicing. You will also need a few arithmetic ops to create the new header for the slice. The point is that these operations will be much faster than decoding every code point to slice UTF-8.

Again I say, I'm not 100% sold on UTF-8, but what you're proposing here
is far worse.
Well, I'm glad you realize some problems with UTF-8, :) even if you dismiss my alternative out of hand.

Reply via email to