Re: Linux console internationalization

Edward H. Trager Thu, 14 Aug 2003 18:12:43 -0700

On Monday 2003.08.11 15:04:07 +0200, Keld Jørn Simonsen wrote:
> On Mon, Aug 11, 2003 at 10:31:16AM +0100, Markus Kuhn wrote:
> 
> > ISO 10646 lacks much of the useful information, guidelines, databases,
> > technical reports, and subsetting information that the Unicode Standard
> > provides. ISO 10646 mentions briefly three implementation levels, which
> > look not too useful in practice and appear a bit like they have been put
> > in on short notice to shut up someone in the committee who wasn't happy
> > with combining characters.
> 
> yes, Unicode have more information than ISO. Unicode has chosen not to
> forward these specs for ISO standardisation to gain control over the
> specification, AFAICT.  It is like the old "embrace and enhance" policy
> that many big companies have so big success in doing. 
> 
> One of the reasons for the non-submissions of Unicode specs to ISO may
> be that many of them are kludges, like having more than one
> representation for a given chearacter (like the fully composed and the
> combining characters, and then the myriads of normalization forms then
> required to make sense of it) and the 16 bit hack of UTF-16.


I'm sure that non-submission of Unicode specs to ISO has nothing to
do with the fact that you think some Unicode specs are kludges.

While I agree with you that it is not an aesthetically pleasing design
to have the myriad of fully composed characters in Unicode,
the reality is that Unicode/ISO-10646 is a system 
invented by humans --and all human systems are, necessarily, imperfect!
  
Unicode shares with all other systems the fact that it is an imperfect 
set of compromises.  At least in the case of Unicode, the set of compromises 
has been carefully considered and reconsidered by groups of people who --I
believe-- are truly interested in producing the best system possible
(in light of the existing precondition of pervasive imperfection in the 
world ...).

Inclusion of precomposed characters is a compromise aimed at achieving
compatability with the myriad legacy encodings which are, arguably,
even more imperfect than Unicode --but which are in daily use all
over the world.  So what if Unicode is imperfect?  I can live with that
(Nobody is forcing me to use precomposed forms if I don't like them:
Unicode also provides the combining characters). Those normalization forms 
are a bit scary --I don't think I can keep any of them straight 
in my mind! (My policy is generally to avoid anything I can't keep straight!).

=======================================
INTERNATIONAL CONSOLE DESIGN PRIORITIES
=======================================

(1) RENDERING:

The first priority of an internationalized console should be --as 
Keld has already mentioned-- correct rendering on the display 
(which as Keld mentioned, is easy for precomposed, harder for not 
precomposed --but of course it has already  been achieved in programs 
like mlterm, etc.).

(2) PLUGGABLE INPUT METHODS:

The second priority is making sure console input methods are
handled correctly.  There should be an easy and simple
way for the user to switch between a multitude of pluggable keyboard input 
methods on the fly.  

What if we had a virtual "Keyboard" class from which we could derive two 
sub-classes: a "DecomposedKeyboard" class, and a 
"PrecomposedKeyboard" class?  

The "DecomposedKeyboard" class always emits decompositions: 
For example, in a subclass called "SpanishDecomposedKeyboard",
the "ñ" in the Spanish words "año" and "montaña" is emitted as 
u006E LATIN SMALL LETTER N + u0303 COMBINING TILDE.  But in an alternate
"SpanishPrecomposedKeyboard" class, the same "ñ" is emitted as u00F1 SMALL
LATIN LETTER N WITH TILDE instead.  The user chooses which keyboard he or she
wants based on his or her needs, forward-marching or legacy-bound, as the
case may be.

These base classes would support non-European languages just the same.
For example, an ArabicDecomposedKeyboard would emit u0628 BEH + u0646 NOON
for the "ﱭ" uFC6D BEH WITH NOON FINAL FORM LIGATURE, while the
ArabicPrecomposedKeyboard would simply emit uFC6D.  If the user knows
he has to interface or send data to some legacy system, then he knows
which one to choose.  Otherwise, he doesn't care and goes with the default
"Decomposed" method for his language.

CJK input methods could be derived from the same base classes too: I'll
elaborate if anyone thinks this needs elaboration.

CONCLUSION: By supporting forward-looking "Decomposed" and more legacy-compatible
"Precomposed" classes, the console system engineers don't have the arduous task of 
trying to determine which is "right" for the user.  If the console engineers
provide a very flexible and open API, then developers and users can do whatever
they want with it!  It will be interesting to watch the genetic development
over time ... This is at the essence of what Linux is all about!

How's that for my 2 cents worth? 

- Ed Trager
  Kellogg Eye Center, University of Michigan
  Ann Arbor, Michigan, USA

> 
> best regards
> keld
> --
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/linux-utf8/
> 
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Linux console internationalization

Reply via email to