On Tue, Apr 10, 2007 at 12:36:28PM +0200, Egmont Koblinger wrote: > Though I cannot answer your original question, I've just found recently that > glibc's wcwidth database suffers from problems. There are a lot of letters > or letter-like symbols that are unprintable according to glibc (wcwidth > returns -1, iswprint returns 0). For example U+0221 (latin small letter d > with curl) is the first such character. I think we should submit a bugreport > for glibc...
Indeed, glibc's character data is horribly outdated and incorrect. There are plenty of unsupported nonspacing characters, even characters that were present in Unicode 4.0. It also considers nonspacing letters to be non-alphabetic, which is a real problem for users of languages which utilize nonspacing letters. As for wcwidth and iswprint, I recently changed my libc implementation to consider all Unicode codepoints except illegal/noncharacter/control codepoints as printable, with a wcwidth of 1 for the BMP and plane 1, and a wcwidth of 2 for planes 2 and 3. While this is still imperfect (it won't account for added characters with width 0, for example), it at least makes it so users with outdated libc/locale data can use the new characters they might need in a minimal sort of way. I would recommend that the glibc maintainers do something similar. > I don't know whether the width info varies or should vary between different > utf-8 locales. The ambiguous characters are wide in CJK locales and narrow in others. This is probably annoying for some CJK users since the characters (such as Greek and Cyrillic) obviously should be narrow typographically; they're wide only for the sake of old programs and ascii-art type stuff which were designed for legacy charsets. IMO they should be made narrow by default in all locales with a modifier like "@wide" or something for the users who actually need them wide. ~Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
