My new roomate just spent three years in Japan and is knows quite
a bit about Japanese computing, the Japanese language, and Linux. I told
him about this post and he gave me a little half-hour lecture about
Japanese keyboard input methods. So....
On Sat, 22 Jan 2000, Andreas Beck wrote:
> > > However there will probably be a problem with eastern languages that utilize
> > > popupmenus to select a symbol from several that have the same keyboard
> > > representation.
That, complex though it is, is only one of four possible input
modes on a Japanese keyboard, which is a standard PC keyboard with a
couple of new modifier keys for use in selecting the desired input mode in
software. Here's how it all breaks down, to the best of my new and
* Method 0: Romanji (sp?). Standard Roman character input.
* Method 1: Key-per-Hiragana-character. Hiragana is a native Japanese
phonetic alphabet, with each key/modifier mapped in a similar manner to
roman alphabets. Fairly easy to handle.
* Method 2: Key-per-Katakana-character. Another(!) native Japanese
phonetic alphabet, used with foreign words. Just needs another
key/modifier lookup table. Also fairly easy to handle.
* Method 3: Inline contextualized phoneme-set-to-Kanji-grammar based
lookup (!!!). This is the hard one. This is what Andy was referring to
below, and it is _wierd_. Essentially it works as follows:
1: You use the Hiragana or Katakana(?) alphabet to enter a string of
phonemes which comprise a _spoken_ Kanji syllable. Each syllabic Kanji
can be composed of 1, 2, 3 or sometimes 4 phonemes. You then need a way
to tell the software side that you are finished entering your Kanji
syllable, which on Japanese Windows is done by hitting the spacebar. So:
Hiragana 'Z' phoneme +
Hiragana 'e' phoneme +
Hiragana 'n' phoneme +
'Zen' Kanji syllabic glyph.
I *think* that this stage can be handled with a simple keypress
accumulator and table lookup.
2: Now that we have a way to turn phonemic Katakana or Hiragana into Kanji
syllables, we next need a way to turn the Kanji syllables into higher
level conceptually-mapped Kanji glyphs or glyph strings ("words"). As
with step #1, this is done using an inlined lookup engine. So:
Kanji 'Hi' syllable +
Kanji 'ra' syllable +
Kanji 'ga' syllable +
Kanji 'na' syllable +
'Hiragana' Kanji (one or more glyphs).
NOTE: This example is contrived. I have no idea what the correct
Kanji for 'Hiragana' actually is.
As you can see, both steps are intertwined, so while the user has
inputted one or more syllabic Kanji but has not yet signaled the end of a
"word", the meaning of the word and thus the final resultant Kanji are
unknown. In Windows, what you see is that the current in-progress Kanji
word is highlighted, and after each syllabic Kanji is entered a "hint box"
containing all the contextually-legal words which begin with the
already-entered Kanji syllable pops up. Rather like Netscape's URL
3: Now that we have the final Kanji, we need to display it. I think all
Kanji are represented in Unicode, so this should be a simple font table
Nice, eh? The unfortunate truth is that this crazy input system
is pretty much required, due to the highly contextualized nature of the
Japanese language. The Kanji for 'Zen' (for example) can have over 20
completely different meanings when used in different grammatical contexts.
Unless you keep track of the running context, it is impossible to
accurately translate subsequently entered Kanji syllables into Kanji
words. This more or less requires a full Japanese grammar engine be
embedded into the input protocol itself |-/.
Luckily, although this is all quite complex, I do not think it
impossible. One or more LibGII translation modules will need to sit in
the input stream and perform the various translation steps, wile also
sending events back and forth to the higher-level LibGGI code which
handles the display updating, highlighting, autocompetion, etc. And I
would be surprised if there were not already some open source code
Japanese grammar engine code out there, give how much has been done
already WRT Japanese locale support.
As far a Chinese and Korean are concerned, I really don't know
that much. I think that the Chinese and Japanese kanji are mostly the
same, but China does not have a phonemeic written alphabet. And written
Vietnamese is all phonetic (Roman alphabet with a bunch of phonetic
modifications to the basic roman characters).
> > oh, that sounds ugly. So is the keyboard symbol interpreted as a glyph
> > and the user must then choose one of the unicodes corresponding to it ?
> I don't know from experience. The way people explained it to me, it is like
> entering the vocal expression by the keyboard and you will then get a list
> of all possible symbols that sound this way (but have different meaning of
> course ...) ...
> > I mean, if I write a text, I'm more concerned about the content than about
> > the form, so I would want my keys to correspond to characters rather
> > than glyphs. Anyway, we'll see how far we can get with this info.
> The problem seems to exist with the languages that have a huge number of
> characters/symbols. As you can't get them onto a reasonably sized keyboard,
> you use a reduced keyboard that can somehow express the vocal properties
> and a database lookup that will give all possible chars that sound this way.
> One probably has to see this live to fully understand how it works. Anyone
> here that works with such a system, who could give some first hand knowledge
My knowledge is secondhand, but Mitch (my roomate) knows all of
this quite well, so I can quickly find out whatever I do not already know.
Let me know if I can be of further help to anyone here.
'Cloning and the reprogramming of DNA is the first serious step in
becoming one with God.'
- Scientist G. Richard Seed