On Tue, Apr 10, 2007 at 12:36:28PM +0200, Egmont Koblinger wrote:
> Though I cannot answer your original question, I've just found recently that
> glibc's wcwidth database suffers from problems. There are a lot of letters
> or letter-like symbols that are unprintable according to glibc (wcwidth
> returns -1, iswprint returns 0). For example U+0221 (latin small letter d
> with curl) is the first such character. I think we should submit a bugreport
> for glibc...

Indeed, glibc's character data is horribly outdated and incorrect.
There are plenty of unsupported nonspacing characters, even characters
that were present in Unicode 4.0. It also considers nonspacing letters
to be non-alphabetic, which is a real problem for users of languages
which utilize nonspacing letters.

As for wcwidth and iswprint, I recently changed my libc implementation
to consider all Unicode codepoints except illegal/noncharacter/control
codepoints as printable, with a wcwidth of 1 for the BMP and plane 1,
and a wcwidth of 2 for planes 2 and 3. While this is still imperfect
(it won't account for added characters with width 0, for example), it
at least makes it so users with outdated libc/locale data can use the
new characters they might need in a minimal sort of way. I would
recommend that the glibc maintainers do something similar.

> I don't know whether the width info varies or should vary between different
> utf-8 locales.

The ambiguous characters are wide in CJK locales and narrow in others.
This is probably annoying for some CJK users since the characters
(such as Greek and Cyrillic) obviously should be narrow
typographically; they're wide only for the sake of old programs and
ascii-art type stuff which were designed for legacy charsets. IMO they
should be made narrow by default in all locales with a modifier like
"@wide" or something for the users who actually need them wide.

~Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to