Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

rajeev joseph sebastian Sun, 05 Nov 2006 13:01:21 -0800

Sorry, Yahoo only allows me to top-post, since it doesnt properly quote the 
previous message. But I have tried to put my message appropriately.

----- Original Message ----
From: Rich Felker <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, October 31, 2006 10:32:29 PM
Subject: Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

On Tue, Oct 31, 2006 at 09:37:34AM -0800, rajeev joseph sebastian wrote:
> Hi Rich Felker,
> 
> I find your work to provide support for Indic text on
> console/terminal to be admirable, and yes, any kind of display is
> far better than none at all (and I do not consider your statement
> insulting) :)
> 
> What I was referring to was a comment along the lines of "... have a
> set of wcwidth classes (say, 1, 2, and 3) and assign - glyphs - to
> one of those classes... ". (Please forgive me if I misunderstood the
> last few posts.) The word to note is "glyph". What I'm saying is you
> cannot in advance specify the width of any given conjunct. It may be
> different in different fonts.

Yes, my use of the word character rather than glyph was intentional
however. I know that the typographically correct way to do spacing
would be to measure the width of glyphs, but for better or worse the
only standardized api (wcwidth) works in terms of characters, and
terminals work in terms of characters. Sometimes this has benefits;
for example it makes it so you can hilight text that was printed to
the terminal and paste it into other apps or back into the terminal,
with exact results which are suitable for filenames and such. This
might not be possible if the app running in the terminal had converted
the text to a glyph representation. So in a way it's nice that the
character->glyph conversion is done at the last step, in the terminal,
since it keeps the data in the logical representation instead of the
presentation form. Of course it also has downsides too as I'm sure
we're all aware.

----------
Well, most correctly implemented Unicode-aware applicatons do this also:
have 2 backing stores, one for text and the other for glyphs. Use the glyph 
representation for display. When a selection is done, the map between the 2 
stores is used to derive the correct text for the selected glyphs.

CTL script implementation has a concept of Logical Cluster which is used for 
this purpose. Basically, text is divided into logical clusters (generally 
mapping to one or more glyphs) which allows to correctly select text, both 
programmatically, as well as visually by the user.

This is also useful in the case of Latin text! 

Currently, most apps I have seen use the precomposed Latin characters, which is 
allowed only because of the stability policy. Most apps do not implement 
complex layout of latin glyphs which causes no-end of problems for Latin 
transliterations of Indic/other text. Although most of the required characters 
for Indic transliteration are already available precomposed, the policy of 
Unicode and the combining mark model do not allow the rest to be encoded. Hence 
the proliferation of PUA codepoints for this purpose. (I hope the situation 
changes for GNU/Linux, but I think it is unlikely).
----------

The other issue here is that there's no standard for glyph numbering,
and Unicode doesn't represent glyphs, so there's really no way an
application running on a terminal could directly print glyphs. Even if
it could, just "cat file_with_indic_text.txt" on the terminal, or
something simple like "ls", wouldn probably not work as expected.

------------
There is no need for glyph numbers and that is one the strong points of 
Unicode. I would strongly suggest to look over the HarfBuzz library which is 
slowly evolving which will allow you to use the work of the best minds in the 
community. It will transform codepoints into glyphs, which you can then use. 
(You can also use Pango if need be).
------------

My hope is to work out a set of width assignments for characters so
that reasonable glyph presentations of the character sequence always
fit in the spacing privided by the sum of the "character widths".
Unfortunately this may result in excess spacing in some (many?) cases,
but I hope it can be made usable if not elegant. My (naive)
understanding is that Kannada conjuncts take place mostly as a
"subscript" to the bottom-right of the initial consonant and vowel
mark, so perhaps they'll look fairly proper in such a scheme.

-------
This is not always true. For Kannada, I will try to confirm that. For 
Malayalam, it is most certainly not true. In fact, for Malayalam, you cannot 
even be sure at any point, whether a particular sequence of characters map to 
only one glyph or more than one glyph; for different fonts, the number of 
conjuncts may be different and thus the very same sequence of characters may 
map to either a single glyph in one font, and multiple glyphs in another, or a 
different number of glyphs in a third font.
-------

> I suppose, we need to develop console specific fonts which can make
> proper use of the available width classes (or the structure you
> propose), however, I don't think any research has occurred in this
> regard.

Well, as long as a reasonable font size were chosen, any font that
fits into the (possibly excessive) width allocation could be used in
principle. For uuterm I'm working on 8x16-cell (and later other larger
sizes) bitmap fonts, which I find much more usable, but there's no
reason other terminal emulators like mlterm couldn't use truetype
fonts in this framework.

> So, a proper answer to your question: how many width classes, really
> needs a lot of work both artistic as well as technical. (Malayalam
> has about 950 conjuncts, so it has to be seen how they can fit into
> those classes).

Well my question is much simpler I think: given a character, what's
the "most space" it can take up in any conjunct it forms?

------------
If you mean to say that each logical cluster will be allocated enough width 
equal to the sum of the widths of each character in that cluster, then I think 
you will allocate much too much space :)
------------

> Speaking of curses, doesnt Debian/(K)ubuntu use curses for its
> installer ? I remember telling the Kubuntu devels to remove Hindi
> from the list of languages, because looking at the rendering is
> really horrible (misplaced vowels, and so many other things,
> unrelated to spacing/width).

Yes.. it's not really a curses problem though. As long as the terminal
supports reordering and ligatures, using curses should not be much of
a problem. I still need to write the reordering stuff for uuterm
though.

----------
I strongly suggest to look over HarfBuzz library. Could you post a link to 
uuterm development website ?
----------

Regards,
Rajeev J Sebastian

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Reply via email to