On Sun, Aug 06, 2006 at 07:34:16AM -0400, Chris Heath wrote:
>
> > To my knowledge there is still no official standard as to which
> > characters have which width, but POSIX specifies the function used to
> > obtain the width of each character (and defines the results as
> > 'locale-specific'), and Markus Kuhn's implementation is the de facto
> > standard and is based on applying very reasonable rules to the
> > published Unicode data (East Asian Width tables and Mn and Cf classes,
> > mainly).
>
> I think you are making some incorrect assumptions about wcwidth.
Not entirely, but yes, some.
> Firstly, it *is* locale-dependent. On my Fedora Core 4 system, I used
> this simple C program to test:
>
> #include <stdio.h>
> #include <locale.h>
> int main(int argc, char** argv) {
> int i;
> sscanf(argv[2], "%x", &i);
> if (setlocale(LC_ALL, argv[1]))
> printf("wcwidth(0x%04X)=%d in locale %s\n", i, wcwidth(i), argv[1]);
> else
> printf("Locale '%s' not found.\n", argv[1]);
> }
>
> And I got this output:
>
> wcwidth(0x00C0)=2 in locale ja_JP.eucJP
> wcwidth(0x00C0)=1 in locale ja_JP.UTF8
This is nothing but glibc being idiotic. Yes it's _allowed_ to do this
according to POSIX (POSIX makes no requirements about correspondence
of the values returned to any other standard) but it's obviously
incorrect for the width of À to be anything but 1, even if it was
historically displayed wide (wtf?!) on some legacy CJK terminal types.
In practice the only way wcwidth's results should be "locale
dependent" is when __STDC_ISO_10646__ is not defined, i.e. when the
implementation does not use UCS codepoints for wchar_t in non-UTF-8
locales. Some implementations (including [old?] BSD) use a one-to-one
mapping of char to wchar_t for legacy 8bit locales and other simpler
mappings for legacy CJK locales.
Keep in mind that as long as your wchar_t values come from mb*towc
functions, the two locale dependencies will cancel out and in practice
the widths returned will be "locale independent" in this case.
> Secondly, wcwidth doesn't appear to be derived from the East Asian width
> tables any more. UAX #11 lists U+00C0 as neutral, but the above example
> demonstrates that it is treated as ambiguous.
Again this is glibc being idiotic. File a bug report. :)
> Thirdly, I'm not sure how you plan to handle Hangul Jamo. Because
> wcwidth works on the level of Unicode characters, not glyphs, I can't
> see how you can handle the more general cases described in Section 3.12
> of the Unicode standard.
I'm aware that there's an issue with Hangul Jamo, but uncertain how
severe it is and what all the implications are.
> On my machine, wcwidth returns 2 for the leading Jamo consonants (L),
> and zero for the vowels (V) and trailing consonants (T). So if you have
> two leading consonants in a row, the second one should overstrike the
> first, but also has an extra width of 2 associated with it.
Why should two leading consonants in a row overstrike one another? Is
this actually used in the script? I seriously doubt that overstriking
is the correct behavior there but I don't know the script.
> I tried wcswidth to see if it returned 2 or 4 when you pass it a string
> with two leading consonants. It returned 4, which might be "incorrect"
> to a Korean eye, but at least it is consistent with wcwidth. However,
> the Single Unix Specification doesn't mandate that wcswidth return the
> sum of wcwidth for each character in the string.
Interesting. I hadn't realized that wcswidth and wcwidth were allowed
to disagree.
> So maybe your font
> system should base its widths on wcswidth instead of wcwidth, in case
> wcswidth is changed to handle this more general case in the future.
My font system has nothing to do with wcwidth of wcswidth. Column
width is an issue of the terminal emulator or other program displaying
text, not the font.
Rich
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/