On Tuesday 08 January 2002 08:58 pm, you wrote: > Followup to: > <[EMAIL PROTECTED]> By > author: [EMAIL PROTECTED] > In newsgroup: linux.utf8 > > > > Character Set Encoding of Tags: > > > =============================== > > > > > > UTF-8 is the default encoding for tag data. Unfortunately > > > UTF-8 muffed it for Asian languages by doing the equivalent of > > > giving the same character codes to English, Russian, and Greek > > > letters.
I have followed this debate for several years. I think the Japanese experts in the Ideographic Rapporteur Group did a wonderful job on Han Unification. Even the Japanese anti-unification stalwarts admit that the characters they find troublesome would not have been considered difficult by anyone educated before 1950, there are only a small number of characters of concern, primarily U+76F4 (Mathews 1004, Nelson 775) and characters containing it, and the problem only arises when multilingual plain text is displayed in an inappropriate font. Since the author can control the font in formatted text, and the user can control the font in viewing plain text, I don't see the problem. (To which opponents of unification reply, "_That_'s the problem", so we don't get anywhere.) On the other hand, I have never heard anyone other than a mathematician complain about the unification of Fraktur and other writing styles for the Latin alphabet, even though Fraktur is extremely difficult for Americans and even younger Germans to read. Fraktur will evidently be disunified in Unicode 4.0 for use in variable names in math, but _not_ for text. > > It's interesting that Japanese and Chinese, which are unrelated > > languages, are sometimes mutually understandable when written, > > but somehow use totally different scripts. Almost all Japanese characters came from China, although there are a few indigenous creations such as the character for mountain pass (touge) and a number of simplified characters that appeared first in Japan. Kana is of course a purely Japanese invention. > Also, there would hardly have been any hurt feelings if U+0065 > (Latin), U+0391 (Greek), and 0+0410 (Cyrillic) had been unified. > It would just not have saved enough code points to bother. Actually, it was impossible because of the source separation rule applied to Chinese, Korean, and Japanese encodings, including Big Five, GB2312, KSC, and JIS. (How ironic.) These standards include various combinations of Latin, Greek, and Cyrillic alphabets in separate code blocks alongside Hanzi, Zhuyin, Hangul, and kana. So LATIN CAPITAL LETTER A, CYRILLIC CAPITAL LETTER A, and GREEK CAPITAL LETTER ALPHA cannot be unified without breaking round-trip conversion for these standards. On the other hand, Kurdish Cyrillic Q has been unified with Latin, since there was no pre-existing character set standard containing both at separate code points. There is, for example, no KOI-8-K(urdish). > As far as "English" and "Russian" are concerned, the various > Latin-script and the various Cyrillic-script languages have been > unified for a long, long time. Do you mean each group separate from the other? Otherwise I can't make sense of this. > Things like U+212A and U+212B should never have been allowed to > happen, on the other hand, IMNSHO. (KELVIN SIGN and ANGSTROM SIGN) Source separation rule, again. We're stuck with them in the standard, but we don't have to use them, or any of the other Compatibility Characters and Presentation Forms. Anyway, the true worst case was the encoding of more than 11,000 Hangul syllables, none of which is required. > -hpa Better an imperfect but entirely usable standard than no standard. -- Edward Cherlin [EMAIL PROTECTED] Does your Web site work? -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
