On 10/17/05, Peter da Silva <pe...@taronga.com> wrote:
In fact ISO-10646 is way too conservative for my tastes. I think each distinct set of case transformation rules should have four 16-bit planes allocated to it, so that truly internationalised characters will be able to reliably toggle case when c&0x10000000 by flipping c&0x20000000. The current set of wishy-washy unified characters with c&0x10000000 == 0 should be left to rot like the hateful legacy things that they are.
This "toggle" you speak of implies that you believe there are only two cases. Thanks to Unicode (and legacy character sets, and existing alphabets which use digraphs which made their way into those legacy character sets, and round-tripping between alphabets which use digraphs and those which don't), we have three, though: upper-case, lower-case, and title-case. The difference between upper-case and title-case becomes apparent with such characters as "nj" (ASCII surrogate for U+01CC LATIN SMALL LETTER NJ), which becomes "Nj" in title-case but "NJ" in upper-case. I'm not sure whether the hate in this case is for the coded character sets that allowed digraphs in as single characters, or the fact that given they exist, software ignores them by blindly uppercasing rather titlecasing when appropriate. (That's not even considering the characters where casing depends on context; the most famous examples are probably Greek sigma (which has two lower-case forms, depending [roughly speaking] on whether it's the last letter of the word) and the letter "i" (which has two upper-case forms, one with dot and one without, depending on the language).) -- Philip Newton <philip.new...@gmail.com>