> To my knowledge there is still no official standard as to which
> characters have which width, but POSIX specifies the function used to
> obtain the width of each character (and defines the results as
> 'locale-specific'), and Markus Kuhn's implementation is the de facto
> standard and is based on applying very reasonable rules to the
> published Unicode data (East Asian Width tables and Mn and Cf classes,
> mainly).
I think you are making some incorrect assumptions about wcwidth.
Firstly, it *is* locale-dependent. On my Fedora Core 4 system, I used
this simple C program to test:
#include <stdio.h>
#include <locale.h>
int main(int argc, char** argv) {
int i;
sscanf(argv[2], "%x", &i);
if (setlocale(LC_ALL, argv[1]))
printf("wcwidth(0x%04X)=%d in locale %s\n", i, wcwidth(i), argv[1]);
else
printf("Locale '%s' not found.\n", argv[1]);
}
And I got this output:
wcwidth(0x00C0)=2 in locale ja_JP.eucJP
wcwidth(0x00C0)=1 in locale ja_JP.UTF8
Secondly, wcwidth doesn't appear to be derived from the East Asian width
tables any more. UAX #11 lists U+00C0 as neutral, but the above example
demonstrates that it is treated as ambiguous.
Thirdly, I'm not sure how you plan to handle Hangul Jamo. Because
wcwidth works on the level of Unicode characters, not glyphs, I can't
see how you can handle the more general cases described in Section 3.12
of the Unicode standard.
On my machine, wcwidth returns 2 for the leading Jamo consonants (L),
and zero for the vowels (V) and trailing consonants (T). So if you have
two leading consonants in a row, the second one should overstrike the
first, but also has an extra width of 2 associated with it.
I tried wcswidth to see if it returned 2 or 4 when you pass it a string
with two leading consonants. It returned 4, which might be "incorrect"
to a Korean eye, but at least it is consistent with wcwidth. However,
the Single Unix Specification doesn't mandate that wcswidth return the
sum of wcwidth for each character in the string. So maybe your font
system should base its widths on wcswidth instead of wcwidth, in case
wcswidth is changed to handle this more general case in the future.
Chris
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/