Re: i18n of abiword -- combining characters

Paul Rohr Fri, 14 Jan 2000 15:08:50 -0600 (CST)
Thai, like some other languages, allows a sequence of individual characters 
to be typed to form a single glyph.  Usually this takes the form of a base 
character which is further modified by other combining characters.  

1.  Character sequence normalization.  (reasonable)
---------------------------------------------------
Thus, there needs to be work done (probably at input time) to normalize 
those sequences of combining characters, and perhaps ignore invalid ones.  
(Otherwise, the variant sequences will make features like spell-check 
prohibitively unreliable.)

The Thai-specific algorithms here seem to be well-defined, although I'm not 
sure whether these get applied before or after step 1.  

2.  Combining characters -- position.  (???)
--------------------------------------------
The current code assumes that every Unicode character will occupy one cell 
of display space of a known width.  However, languages like Thai render 
sequences of several characters into the same display cell.  For a WYSIWYG 
word processor, changing this fundamental assumption has a variety of 
implications.  

2a.  selection semantics -- When you select one glyph, you select one or 
more characters.  All edit operations that currently affect one "character" 
will need to be reexamined to see whether they should affect the glyph, or 
one of the component characters in the glyph.  For example, does backspace 
delete all characters in the glyph, or does it remove the last combining 
character, changing the glyph but maintaining the cursor position?  

2b.  cursor semantics -- Similarly, moving the cursor one glyph to the right 
means moving one or more characters to the right.  

3.  Combining characters -- width.  (???)
-----------------------------------------
More fundamentally, our charwidth-handling logic will need to be expanded to 
handle combining characters.  In most cases, the width of the first 
character in the sequence determines the width of the entire cell.  However, 
some combinations make the entire cell wider.  

Currently, the formatter maintains a per-character array of widths.  For 
combining characters, we can't just add those widths to the total width of 
the word.  Instead, they'll somehow need to be folded into the width 
calculation for the resulting glyph.  The exact algoritm needed here depends 
on how information about the "width" of combining characters is stored in 
the font.  

I'm totally just guessing now, but two simple approaches come to mind:

3a.  Sum.  If we assume that most combining characters don't affect the 
glyph width, store their charwidth as zero, and then calculate the glyph 
width as the sum of all the characters.  (Presumably, then, if some 
combining characters do affect the overall glyph width, we'd have to store 
the difference.)

3b.  Max.  Alternatively, if the cell width of each combining character 
indicates the width of the resulting cell when this combining character is 
used, then we might instead be able to just calculate the glyph width as the 
maximum of its constitutent charwidths.  

However, I really don't know what I'm talking about here.  More 
investigation is definitely needed, depending on the capabilities of each 
platform's GR_Graphics::measureString() implementation. 

4.  Combining characters -- rendering.  (???, platform-specific)  
----------------------------------------------------------------
On each platform, someone will need to investigate whether the 
text-rendering primitives know how to properly combine a character sequence 
into a single glyph.  If so, drawing should be pretty easy.  If not, adding 
logic to do all that rendering from the constituent glyphs in the font may 
be difficult.  

Paul
Re: i18n of abiword -- combining characters

Reply via email to