Re: Today's programming challenge - How's your Range-Fu ?

Shachar Shemesh via Digitalmars-d Mon, 20 Apr 2015 00:01:43 -0700

On 19/04/15 22:58, ketmar wrote:

On Sun, 19 Apr 2015 07:54:36 +0000, John Colvin wrote:


it's not crazy, it's just broken in all possible ways:
http://file.bestmx.net/ee/articles/uni_vs_code.pdf


This is not a very accurate depiction of Unicode.

For example:

And, moreover, BOM is meaningless without mentioning of encoding. So wehave to specify encoding anyway.

No. BOM is what lets your auto-detect the encoding. If you know you willbe using UTF-8, 16 or 32 with an unknown encoding, BOM will tell youwhich it is. That is its entire purpose, in fact.


There, pretty much, goes point #1.

And then:

Unicode contains at least “writing direction” control symbols (LTR isU+200E and RTL is U+200F) which role is IDENTICAL to the role ofcodepage-switching symbols with the associated disadvantages.

That's just ignorance of how the UBA (TR#9) works. LRM and RLM are mereinvisible characters with defined directionality. Cutting them away froma substring would not invalidate your text more than cutting away actualtext would under the same conditions. In any case, unlike page switchingsymbols, it would only affect your display, not your understanding ofthe text.


So point #2 is out.

He has some valid argument under point #3, but also lots of !(@#&$nonsense. He is right, I think, that denoting units with separate codepoints makes no sense, but the rest of his arguments seem completelyoff. For example, asking Latin and Cyrillic to share the same regionmerely because some letters look alike makes no sense, implementation wise.

Points #4, #5, #6 and #7 are the same point. The main objection I havethere is his assumption that the situation is, somehow, worse than itwas. Yes, if you knew your encoding was Windows-1255, you could assumethe text is Hebrew.


Or Yiddish.

And this, I think, is one of the encodings with the least number oflanguages riding on it. Windows-1256 has Arabic, Persian, Urdu andothers. Windows-1251 has the entire western Europe script. As pointedout elsewhere in this thread, Spanish and French treat case folding ofaccented letters differently.

Also, we see that the solution he thinks would work better actuallydoesn't. People living in France don't switch to a QWERTY keyboard whenthey want to type English. They type English with their AZERTY keyboard.There simply is no automatic way to tell what language something istyped in without a human telling you (or applying content based heuristics).

Microsoft Word stores, for each letter, which was the keyboard languageit was typed with. This causes great problems when copying to othereditors, performing searches, or simply trying to get bidirectional textto appear correctly. The problem is so bad that phone numbers where theprefix appears after the actual number is not considered bad form orunusual, even in official PR material or when sending resumes.

In fact, the only time you can count on someone to switch keyboards iswhen they need to switch to a language with a different alphabet. NoRussian speaker will type English using the Russian layout, even if whatshe has to say happens to use letters with the same glyphs. You simplydo not plan that much ahead.

The point I'm driving at is that just because some posted some rant onthe Internet doesn't mean it's correct. When someone says something isbroken, always ask them what they suggest instead.


Shachar

Re: Today's programming challenge - How's your Range-Fu ?

Reply via email to