Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Rich Felker Tue, 31 Oct 2006 12:09:36 -0800

On Tue, Oct 31, 2006 at 09:37:34AM -0800, rajeev joseph sebastian wrote:
> Hi Rich Felker,
> 
> I find your work to provide support for Indic text on
> console/terminal to be admirable, and yes, any kind of display is
> far better than none at all (and I do not consider your statement
> insulting) :)
> 
> What I was referring to was a comment along the lines of "... have a
> set of wcwidth classes (say, 1, 2, and 3) and assign - glyphs - to
> one of those classes... ". (Please forgive me if I misunderstood the
> last few posts.) The word to note is "glyph". What I'm saying is you
> cannot in advance specify the width of any given conjunct. It may be
> different in different fonts.


Yes, my use of the word character rather than glyph was intentional
however. I know that the typographically correct way to do spacing
would be to measure the width of glyphs, but for better or worse the
only standardized api (wcwidth) works in terms of characters, and
terminals work in terms of characters. Sometimes this has benefits;
for example it makes it so you can hilight text that was printed to
the terminal and paste it into other apps or back into the terminal,
with exact results which are suitable for filenames and such. This
might not be possible if the app running in the terminal had converted
the text to a glyph representation. So in a way it's nice that the
character->glyph conversion is done at the last step, in the terminal,
since it keeps the data in the logical representation instead of the
presentation form. Of course it also has downsides too as I'm sure
we're all aware.

The other issue here is that there's no standard for glyph numbering,
and Unicode doesn't represent glyphs, so there's really no way an
application running on a terminal could directly print glyphs. Even if
it could, just "cat file_with_indic_text.txt" on the terminal, or
something simple like "ls", wouldn probably not work as expected.

My hope is to work out a set of width assignments for characters so
that reasonable glyph presentations of the character sequence always
fit in the spacing privided by the sum of the "character widths".
Unfortunately this may result in excess spacing in some (many?) cases,
but I hope it can be made usable if not elegant. My (naive)
understanding is that Kannada conjuncts take place mostly as a
"subscript" to the bottom-right of the initial consonant and vowel
mark, so perhaps they'll look fairly proper in such a scheme.

> I suppose, we need to develop console specific fonts which can make
> proper use of the available width classes (or the structure you
> propose), however, I don't think any research has occurred in this
> regard.

Well, as long as a reasonable font size were chosen, any font that
fits into the (possibly excessive) width allocation could be used in
principle. For uuterm I'm working on 8x16-cell (and later other larger
sizes) bitmap fonts, which I find much more usable, but there's no
reason other terminal emulators like mlterm couldn't use truetype
fonts in this framework.

> So, a proper answer to your question: how many width classes, really
> needs a lot of work both artistic as well as technical. (Malayalam
> has about 950 conjuncts, so it has to be seen how they can fit into
> those classes).

Well my question is much simpler I think: given a character, what's
the "most space" it can take up in any conjunct it forms?

> Speaking of curses, doesnt Debian/(K)ubuntu use curses for its
> installer ? I remember telling the Kubuntu devels to remove Hindi
> from the list of languages, because looking at the rendering is
> really horrible (misplaced vowels, and so many other things,
> unrelated to spacing/width).

Yes.. it's not really a curses problem though. As long as the terminal
supports reordering and ligatures, using curses should not be much of
a problem. I still need to write the reordering stuff for uuterm
though.

> It is unfortunate, that many developers think that by using
> widestrings for each character is equivalent to support for all
> languages under Unicode. I guess some even think that the
> dotted-circle is a part of the script ;)

Haha yeah. I still can't believe Roman Czyborra drew the original GNU
Unifont with those hideous dotted circles in it... (Yes he knew they
weren't part of the script, but...) My hope is to make it so that
using multibyte char functions + wcwidth is sufficient for _usable_
support for all langs in apps that run on terminals. Then, as more
users of these langs use the apps in question, hopefully other things
(like line folding in scripts without word spacing, better spacing,
integration with input methods, etc.) will come. Unlike most of the
GUI projects working on these issues my goal is not to put
word-processor-type layout in every app, just to fix what's broken and
make them usable with more languages.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Reply via email to