For some reason this posting by H. S. Teoh shows up on the
mailing list but not on the forum.
On Sat May 25 13:42:10 PDT 2013, H. S. Teoh wrote:
On Sat, May 25, 2013 at 10:07:41AM +0200, Joakim wrote:
The vast majority of non-english alphabets in UCS can be
encoded in
a single byte. It is your exceptions that are not relevant.
I'll have you know that Chinese, Korean, and Japanese account
for a
significant percentage of the world's population, and therefore
arguments about "vast majority" are kinda missing the forest
for the
trees. If you count the number of *alphabets* that can be
encoded in a
single byte, you can get a majority, but that in no way
reflects actual
usage.
Not just "a majority," the vast majority of alphabets,
representing 85% of the world's population.
>The only alternatives to a variable width encoding I can see
>are:
>- Single code page per string
>This is completely useless because now you can't concatenate
>strings of different code pages.
I wouldn't be so fast to ditch this. There is a real argument
to be
made that strings of different languages are sufficiently
different
that there should be no multi-language strings. Is this the
best
route? I'm not sure, but I certainly wouldn't dismiss it out
of hand.
This is so patently absurd I don't even know how to begin to
answer...
have you actually dealt with any significant amount of text at
all? A
large amount of text in today's digital world are at least
bilingual, if
not more. Even in pure English text, you occasionally need a
foreign
letter in order to transcribe a borrowed/quoted word, e.g.,
"cliché",
"naïve", etc.. Under your scheme, it would be impossible to
encode any
text that contains even a single instance of such words. All it
takes is
*one* word in a 500-page text and your scheme breaks down, and
we're
back to the bad ole days of codepages. And yes you can say
"well just
include é and ï in the English code page". But then all it
takes is a
single math formula that requires a Greek letter, and your text
is
non-encodable anymore. By the time you pull in all the French,
German,
Greek letters and math symbols, you might as well just go back
to UTF-8.
I think you misunderstand what this implies. I mentioned it
earlier as another possibility to Walter, "keep all your strings
in a single language, with a different format to compose them
together." Nobody is talking about disallowing alphabets other
than English or going back to code pages. The fundamental
question is whether it makes sense to combine all these different
alphabets and their idiosyncratic rules into a single string and
encoding.
There is a good argument to be made that the differences outweigh
the similarities and you'd be better off keeping each
language/alphabet in its own string. It's a question of
modeling, just like a class hierarchy. As I said, I'm not sure
this is the best route, but it has some real strengths.
The alternative is to have embedded escape sequences for the
rare
foreign letter/word that you might need, but then you're back
to being
unable to slice the string at will, since slicing it at the
wrong place
will produce gibberish.
No one has presented this as a viable option.
I'm not saying UTF-8 (or UTF-16, etc.) is panacea -- there are
things
about it that are annoying, but it's certainly better than the
scheme
you're proposing.
I disagree.
On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:
And just how exactly does that help with slicing? If anything,
it makes
slicing way hairier and error-prone than UTF-8. In fact, this
one point
alone already defeated any performance gains you may have had
with a
single-byte encoding. Now you can't do *any* slicing at all
without
convoluted algorithms to determine what encoding is where at the
endpoints of your slice, and the resulting slice must have new
headers
to indicate the start/end of every different-language
substring. By the
time you're done with all that, you're going way slower than
processing
UTF-8.
There are no convoluted algorithms, it's a simple check if the
string contains any two-bye encodings, a check which can be done
once and cached. If it's single-byte all the way through, no
problems whatsoever with slicing. If there are two-byte
languages included, the slice function will have to do a little
arithmetic calculation before slicing. You will also need a few
arithmetic ops to create the new header for the slice. The point
is that these operations will be much faster than decoding every
code point to slice UTF-8.
Again I say, I'm not 100% sold on UTF-8, but what you're
proposing here
is far worse.
Well, I'm glad you realize some problems with UTF-8, :) even if
you dismiss my alternative out of hand.