Re: Linux console internationalization

Edward H. Trager Thu, 14 Aug 2003 13:10:34 -0700

On Tuesday 2003.08.12 15:19:53 +0200, Kent Karlsson wrote:
> 
> Edward H. Trager wrote:
> 
> > Inclusion of precomposed characters is a compromise aimed at achieving
> ...
> > (Nobody is forcing me to use precomposed forms if I don't like them:
> > Unicode also provides the combining characters). 
> 
> True. But I have the impression that Keld rather likes precomposed...
> 
> > What if we had a virtual "Keyboard" class from which we could 
> > derive two sub-classes: a "DecomposedKeyboard" class, and a 
> > "PrecomposedKeyboard" class?  
> 
> Classes? I find it somewhat scary to have to define (C++??, Java???)
> classes to make keyboard layouts. At present, Linux (and other
> Unixes) mostly use XKB which has a text format input for keyboard
> layouts, which is unrelated to any "general purpose programming
> language". Likewise, MacOS X has an XML based data file format for
> specifying keyboard layouts. (Both systems "compile" these to
> something more efficient for runtime.)
>


I have a comment about this below ...

> > These base classes would support non-European languages just the same.
> > For example, an ArabicDecomposedKeyboard would emit u0628 BEH 
> > + u0646 NOON
> > for the "ﱭ" uFC6D BEH WITH NOON FINAL FORM LIGATURE, while the
> > ArabicPrecomposedKeyboard would simply emit uFC6D.  If the user knows
> > he has to interface or send data to some legacy system, then he knows
> > which one to choose.  Otherwise, he doesn't care and goes 
> > with the default "Decomposed" method for his language.
> 
> By and large, "decomposed" vs. "precomposed" only makes sense
> for Latin, Greek, Cyrillic, Hangul, and Hiragana/Katakana.
> There are precomposed letters also for other scripts, but mostly they
> are not to be recommended. 

Perhaps not recommended, but are they being used? For example, in the
Thai block of Unicode,"  ำ " u0e33 SARA AM is encoded, and it appears 
on the TIS-620 keyboard layout.  The TIS-620 keyboard also includes
" ํ " u0e4d THAI NIKHAHIT and " า " u0e32 THAI SARA AA.  The 
Unicode normalization table for Thai,
http://www.unicode.org/charts/normalization/chart_Thai.html,
shows the decomposition of u0e33 SARA AM to u0e4d NIKHAHIT + 
u0e32 SARA AA in the "KC" and "KD" normalization forms.  Thai people
consider SARA AM as a separate vowel letter and *everybody* types 
"  ำ " u0e33 for words that require that vowel.  *Nobody* types 
u0e4d NIKHAHIT followed by u0e32 SARA AA.  However, if I type
u0e4d NIKHAHIT followed by u0e32 SARA AA (in Yudit, for example), it
looks just as if I had typed u0e33 SARA AM. And when I save
the file, Yudit does not normalize u0e33 SARA AM to its decomposition
of u0e4d NIKHAHIT followed by u0e32 SARA AA: so, I can easily make
a file that has both the precomposed and the decomposed forms of
THAI SARA AM in it.  With the appropriate console keyboard, I could
do the same thing on the console.

Note that this behaviour is different than what (as far as I can tell
usually...) happens for Arabic.
Using Yudit as the example again, one can type arabic words containing
various forms of the LAM-ALEF ligatures.  On the keyboard, one types
the key for LAM and separately the key for ALEF and Yudit figures out
which ligature to use from its shaping engine.  When the file is saved,
one can see that the file on disk only contains the unicode values
for LAM and ALEF --and none of the precomposed presentation form values
(which is really a good thing for Arabic).

> In particular the ones you find in U+Fxxx
> should not be used (unless you REALLY have to). The one you referred
> to is in addition a contextual form (a FINAL form) and should only be
> used at the end of a word (as far as it has been written). 

Yes, that was a bad example.  I probably should have used LAM - ALEF
as a better example.

> Normally, these things are dynamically handled via contextual shaping rules,
> and the presentation form *characters* are NOT used. A data file on
> Arabic shaping rules is maintained by Unicode (but NOT by ISO):
> ArabicShaping.txt.
> 
> Why do something entirely different for the "console". Why not adapt
> XKB so that it, and its data files, can work for the "console" too?
> (Likewise for an input method mechanism (XIM??).)

It seems to me that something like the OS X XML-based file format would be
a lot nicer to work with than the legacy X formats.  The reason I suggested
classes is: Wouldn't it be nice if there was a *clean* *unified* way to specify
both keyboard layouts *and* input methods for both *console* and *X*?  It's
just part of my "dream scenario" for what Linux ought to be like.  My
mental deconstruction of the problem was facilitated by thinking of it
in an object-oriented framework, and hence, classes.

For what it is worth, my "dream" scenario looks like this: 
My requirements are that I can rotate through
any number of keyboard layouts and input methods using some hot key combination.
For the sake of example, lets just say I've defined "SCROLL LOCK" as my input method
switching hot key.  So, I turn on my machine, and from the console I can press
SCROLL LOCK or whatever key to switch from my ASCII keyboard to Thai, press again
to switch to Arabic, press again to switch to Chinese ZiRanMa input method,
press again to switch to Chinese intelligent PinYin method, press again to ...
You get the idea.  The important thing to notice is that I don't do anything 
different to switch to a complicated CJ(V)K input method than I do to switch 
from, say, an English to a German or from a German to a Hebrew keyboard.

Then I boot into X, and because X now uses the same keyboard
and input method definition files as the console, it works just the same.

Of course, most users are never going to need more than just two keyboard layouts:
for example, one vanilla QWERTY US English layout, and one for their own language,
whatever it is.  But then there are other users, like myself, who in reality do
want English, Thai, Chinese, ... If you design the system to meet the needs of those
users, it will work really well for everyone else too.

- Ed Trager

> 
>               /kent k
> 
> --
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/linux-utf8/
> 
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Linux console internationalization

Reply via email to