Edward H. Trager wrote:
Mlterm (http://mlterm.sourceforge.net/) is a multilingual-capable terminal emulator which handles combining characters. Mlterm with a console-based mail reader like mutt works pretty well. However, one is still at the mercy of the fonts. Even an OpenType font which handles diacritic stacking may still not place diacritics properly for Vietnamese unless that font was really designed with vietnamese in mind. And, supposing you do find a font with very nice typographic placement of diacritics for Vietnamese, that same font might not work so well for Greek, for example. So, the current situation is that in practice you get more readable results when your unicode text actually uses the code points for the precomposed glyphs.
This seems to be correct for HTML & XML at least since
W3C's (draft) "Character Model for the World Wide Web 1.0: Normalization" specifies NFC for HTML & XML. <http://www.w3.org/TR/charmod-norm/> - don't know whether or
not any particular form is specified for other protocols.
Hi,
Slightly based on it, I have some questions with the combining characters. It's clear to me how they should be handled if the complete text to be displayed in known in advance. But I don't know what has to be done if one tries to display a real-time text flow.
Just think of a talk/ytalk enhancement working with UTF-8 encoding and NFD representation. And network lags...
Maybe I type an "�", first "a" is sent over the network, then for some reason some packets are lost or there's a short network failure, and the combining acute is only sent five seconds later. The receiver party has to first display an "a" since it doesn't know it's going to be continued. Then later it has to be able to put an accent over the already displayed character.
What's the design rationale of the combining character following instead of preceding the letter itself? Just think of TeX's \'a, or the combining character feature of linux console and X window, here always the accent is entered first, which makes it much easier to handle these input streams.
Maybe the rationale was based on the simple fact that when you write on a
piece of paper, you write the "a" first, then the accent or other diacritical
marks ...
Yes - something like that.
TUS IV Ch. 2.2 Design Principles: Logical Order:
"Unicode text is stored in logical order in the memory representation, roughly corresponding to the order in which text is typed in via the keyboard."
also W3C's (draft) "Character Model for the World Wide Web 1.0: Fundamentals" states: "Protocols, data formats and APIs MUST store, interchange or process text data in logical order."
- Chris
-- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
