On 19/04/15 22:58, ketmar wrote:
On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:

it's not crazy, it's just broken in all possible ways:
http://file.bestmx.net/ee/articles/uni_vs_code.pdf


This is not a very accurate depiction of Unicode.

For example:
And, moreover, BOM is meaningless without mentioning of encoding. So we have to specify encoding anyway.

No. BOM is what lets your auto-detect the encoding. If you know you will be using UTF-8, 16 or 32 with an unknown encoding, BOM will tell you which it is. That is its entire purpose, in fact.

There, pretty much, goes point #1.

And then:
Unicode contains at least “writing direction” control symbols (LTR is U+200E and RTL is U+200F) which role is IDENTICAL to the role of codepage-switching symbols with the associated disadvantages.

That's just ignorance of how the UBA (TR#9) works. LRM and RLM are mere invisible characters with defined directionality. Cutting them away from a substring would not invalidate your text more than cutting away actual text would under the same conditions. In any case, unlike page switching symbols, it would only affect your display, not your understanding of the text.

So point #2 is out.

He has some valid argument under point #3, but also lots of !(@#&$ nonsense. He is right, I think, that denoting units with separate code points makes no sense, but the rest of his arguments seem completely off. For example, asking Latin and Cyrillic to share the same region merely because some letters look alike makes no sense, implementation wise.


Points #4, #5, #6 and #7 are the same point. The main objection I have there is his assumption that the situation is, somehow, worse than it was. Yes, if you knew your encoding was Windows-1255, you could assume the text is Hebrew.

Or Yiddish.

And this, I think, is one of the encodings with the least number of languages riding on it. Windows-1256 has Arabic, Persian, Urdu and others. Windows-1251 has the entire western Europe script. As pointed out elsewhere in this thread, Spanish and French treat case folding of accented letters differently.

Also, we see that the solution he thinks would work better actually doesn't. People living in France don't switch to a QWERTY keyboard when they want to type English. They type English with their AZERTY keyboard. There simply is no automatic way to tell what language something is typed in without a human telling you (or applying content based heuristics).

Microsoft Word stores, for each letter, which was the keyboard language it was typed with. This causes great problems when copying to other editors, performing searches, or simply trying to get bidirectional text to appear correctly. The problem is so bad that phone numbers where the prefix appears after the actual number is not considered bad form or unusual, even in official PR material or when sending resumes.

In fact, the only time you can count on someone to switch keyboards is when they need to switch to a language with a different alphabet. No Russian speaker will type English using the Russian layout, even if what she has to say happens to use letters with the same glyphs. You simply do not plan that much ahead.

The point I'm driving at is that just because some posted some rant on the Internet doesn't mean it's correct. When someone says something is broken, always ask them what they suggest instead.

Shachar

Reply via email to