On Monday 2003.08.11 15:04:07 +0200, Keld Jørn Simonsen wrote: > On Mon, Aug 11, 2003 at 10:31:16AM +0100, Markus Kuhn wrote: > > > ISO 10646 lacks much of the useful information, guidelines, databases, > > technical reports, and subsetting information that the Unicode Standard > > provides. ISO 10646 mentions briefly three implementation levels, which > > look not too useful in practice and appear a bit like they have been put > > in on short notice to shut up someone in the committee who wasn't happy > > with combining characters. > > yes, Unicode have more information than ISO. Unicode has chosen not to > forward these specs for ISO standardisation to gain control over the > specification, AFAICT. It is like the old "embrace and enhance" policy > that many big companies have so big success in doing. > > One of the reasons for the non-submissions of Unicode specs to ISO may > be that many of them are kludges, like having more than one > representation for a given chearacter (like the fully composed and the > combining characters, and then the myriads of normalization forms then > required to make sense of it) and the 16 bit hack of UTF-16.
I'm sure that non-submission of Unicode specs to ISO has nothing to do with the fact that you think some Unicode specs are kludges. While I agree with you that it is not an aesthetically pleasing design to have the myriad of fully composed characters in Unicode, the reality is that Unicode/ISO-10646 is a system invented by humans --and all human systems are, necessarily, imperfect! Unicode shares with all other systems the fact that it is an imperfect set of compromises. At least in the case of Unicode, the set of compromises has been carefully considered and reconsidered by groups of people who --I believe-- are truly interested in producing the best system possible (in light of the existing precondition of pervasive imperfection in the world ...). Inclusion of precomposed characters is a compromise aimed at achieving compatability with the myriad legacy encodings which are, arguably, even more imperfect than Unicode --but which are in daily use all over the world. So what if Unicode is imperfect? I can live with that (Nobody is forcing me to use precomposed forms if I don't like them: Unicode also provides the combining characters). Those normalization forms are a bit scary --I don't think I can keep any of them straight in my mind! (My policy is generally to avoid anything I can't keep straight!). ======================================= INTERNATIONAL CONSOLE DESIGN PRIORITIES ======================================= (1) RENDERING: The first priority of an internationalized console should be --as Keld has already mentioned-- correct rendering on the display (which as Keld mentioned, is easy for precomposed, harder for not precomposed --but of course it has already been achieved in programs like mlterm, etc.). (2) PLUGGABLE INPUT METHODS: The second priority is making sure console input methods are handled correctly. There should be an easy and simple way for the user to switch between a multitude of pluggable keyboard input methods on the fly. What if we had a virtual "Keyboard" class from which we could derive two sub-classes: a "DecomposedKeyboard" class, and a "PrecomposedKeyboard" class? The "DecomposedKeyboard" class always emits decompositions: For example, in a subclass called "SpanishDecomposedKeyboard", the "ñ" in the Spanish words "año" and "montaña" is emitted as u006E LATIN SMALL LETTER N + u0303 COMBINING TILDE. But in an alternate "SpanishPrecomposedKeyboard" class, the same "ñ" is emitted as u00F1 SMALL LATIN LETTER N WITH TILDE instead. The user chooses which keyboard he or she wants based on his or her needs, forward-marching or legacy-bound, as the case may be. These base classes would support non-European languages just the same. For example, an ArabicDecomposedKeyboard would emit u0628 BEH + u0646 NOON for the "ﱭ" uFC6D BEH WITH NOON FINAL FORM LIGATURE, while the ArabicPrecomposedKeyboard would simply emit uFC6D. If the user knows he has to interface or send data to some legacy system, then he knows which one to choose. Otherwise, he doesn't care and goes with the default "Decomposed" method for his language. CJK input methods could be derived from the same base classes too: I'll elaborate if anyone thinks this needs elaboration. CONCLUSION: By supporting forward-looking "Decomposed" and more legacy-compatible "Precomposed" classes, the console system engineers don't have the arduous task of trying to determine which is "right" for the user. If the console engineers provide a very flexible and open API, then developers and users can do whatever they want with it! It will be interesting to watch the genetic development over time ... This is at the essence of what Linux is all about! How's that for my 2 cents worth? - Ed Trager Kellogg Eye Center, University of Michigan Ann Arbor, Michigan, USA > > best regards > keld > -- > Linux-UTF8: i18n of Linux on all levels > Archive: http://mail.nl.linux.org/linux-utf8/ > -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
