Re: questions with combining characters [was: Unicode: endpoint of evolution of encodings?]

Christopher Fynn Wed, 17 Nov 2004 17:45:01 -0800

Edward H. Trager wrote:

Mlterm (http://mlterm.sourceforge.net/) is a multilingual-capable terminal
emulator which handles combining characters.  Mlterm with a console-based
mail reader like mutt works pretty well.  However, one is still at the
mercy of the fonts.  Even an OpenType font which handles diacritic stacking
may still not place diacritics properly for Vietnamese unless that font
was really designed with vietnamese in mind.  And, supposing you do find a
font with very nice typographic placement of diacritics for Vietnamese, that
same font might not work so well for Greek, for example.  So, the current
situation is that in practice you get more readable results when your unicode
text actually uses the code points for the precomposed glyphs.

This seems to be correct for HTML & XML at least since W3C's (draft) "Character Model for the World Wide Web 1.0: Normalization" specifies NFC for HTML & XML. <http://www.w3.org/TR/charmod-norm/> - don't know whether or not any particular form is specified for other protocols.

Hi,

Slightly based on it, I have some questions with the combining characters.
It's clear to me how they should be handled if the complete text to be
displayed in known in advance. But I don't know what has to be done if one
tries to display a real-time text flow.

Just think of a talk/ytalk enhancement working with UTF-8 encoding and NFD
representation. And network lags...

Maybe I type an "�", first "a" is sent over the network, then for some
reason some packets are lost or there's a short network failure, and the
combining acute is only sent five seconds later. The receiver party has to
first display an "a" since it doesn't know it's going to be continued. Then
later it has to be able to put an accent over the already displayed
character.

What's the design rationale of the combining character following instead of
preceding the letter itself? Just think of TeX's \'a, or the combining
character feature of linux console and X window, here always the accent is
entered first, which makes it much easier to handle these input streams.

Maybe the rationale was based on the simple fact that when you write on a piece of paper, you write the "a" first, then the accent or other diacritical marks ...


Yes - something like that.

TUS IV Ch. 2.2 Design Principles: Logical Order: "Unicode text is stored in logical order in the memory representation, roughly corresponding to the order in which text is typed in via the keyboard."

also W3C's (draft) "Character Model for the World Wide Web 1.0: Fundamentals" states: "Protocols, data formats and APIs MUST store, interchange or process text data in logical order."

- Chris


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: questions with combining characters [was: Unicode: endpoint of evolution of encodings?]

Reply via email to