Working on my character cell font/terminal problem, I've been doing some research on Devanagari and other Indic scripts and the way they handle consonant clusters. Unlike Tibetan which naturally fits Unicode's combining character semantics and POSIX's wcwidth(), Indic scripts are unfortunately very unfriendly to character cell devices, at least in the existing width interpretation, which I will call "WI1" for "Width Interpretation #1".
The obvious small problem is the left-combining "i" vowel mark, which (I was not before aware of this) combines at the left of the whole cluster, not just the preceding character. This can be handled in WI1, but requires substitution rules which can span arbitrary numbers of cells, making update awkward. Also my understanding is that the superscript stroke of the "i" is supposed to span the whole syllable, making things more tricky for these arbitrary-width clusters. A more fundamental problem, which may even be seen as a cause of the former problem, is the combining/ligature nature of clusters. Many clusters such as "kka" or anything involving "ra" _should_ occupy fewer columns than the number of letters in the cluster. With 'dead' "ra" characters in the cluster it is particularly bad for them to occupy a column of their own since they actually become combining marks on another glyph cell (which may be of arbitrary distance from the "ra" character). Under WI1, each consonant has a wcwidth of 1. Thus, the only way to handle character-cell Indic scripts under WI1 is to have the dead "ra" turn into a blank space, which will look very odd. What I'd like to propose is a new width interpretation for Indic scripts, "WI2". Under WI2: - All independent vowels and consonants have a width of 2. - The virama has a width of -2 and makes the previous character part of a combining stack with the character that follows. - Each double-width character cell contains an entire consonant cluster, not just a single glyph. Much like Hangul Jamo. - All dependent vowel marks are simple width-0 nonspacing characters which apply to the whole cluster. Also like Hangul Jamo. Note that this includes the left-combining "i" vowel which just appears at the left of the character cell and whose superscript stroke is of fixed width. According to casual examination of many Devanagari clusters, they appear to fit nicely into double-width cells, with complexity/density similar to CJK glyphs. Simple fonts without complex ligatures could just squeeze the individual nominal glyphs into the cell and render them with overstrike (using some simple context rules for the positioning) while nice fonts would use either dedicated ligature glyphs or a mixture of dedicated ligatures with overstriking glyphs that result in the correct ligatures. Please note that all of the above applies only to character cell displays and wcwidth. Naturally it should be ignored and existing systems used for elegant Indic script layout in variable-width fonts, but I believe that the system WI2 (or a variant on it) provides much more reasonable, workable Indic script support than WI1. If anyone could provide comments on the following issues, I would much appreciate it: 1. Does any existing character cell application (terminal emulator) both display correctly-rendered Indic text and conform to WI1, i.e. does it update column position according to wcwidth() and not the OpenType-rendered width of the text string? I suspect not. RTFS'ing mlterm it seems like it does not. I can't find any good info on ncst-term. 2. Are there serious limitations of WI2 that make it impossible to display [legibly] certain consonant clusters? Can the ZWJ/ZWNJ semantics be satisfied correctly? 3. Other comments? Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
