Bruno Haible wrote on 2000-09-27 13:19 UTC:
> Ad 1) wcwidth() can also be used as an equivalent of iswprint(). This
> is legitimate according to the spec, and GNU ls already uses it this
> way (because calling iswprint and then wcwidth would be redundant).

Yes, this aspect causes me pain at the moment with regard to how the
combining characters are treated. The last glibc 2.1.93+ version of
glibc that I tested had iswprint() = 0 for every combining character and
as a directly hardwired consequences also wcwidth() = -1 for every
combining character. That is definitely useless in the current form
unless you restrict yourself to UCS Level 1! First of all, normal
combining characters also cause ink (or electrons or lack thereof) to
appear on the output device, so they are definitely "printable" in any
sense of the word that I can find, although they are zero-width and
positioned relative to another glyph.

The obvious and clean solution:

  iswprint(COMBINING *) = 1

such that we can have

  wcwidth(COMBINING *) = 0

and then wcswidth() will become fully usable on decomposed Unicode
strings.

I also favor to implement in glibc's locales the idea of treating the
"ZERO-WIDTH *" characters exactly like combining characters, that is
printable (they will after all appear in the middle of words and their
only control function is to influence ligatures, which all other
printable characters do as well!) and of width zero.

This way, the wcwidth() and iswprint() relationship remains as it is, no
software has to treat zero-width spaces as special cases and everything
else also runs nice and smoothly.

If iswprint(c) = 0, and c is not one of the well-known control
characters (esc, lf, ht, etc.), then applications should probably not
send it to a terminal emulator (send some default character instead). It
is a matter of taste, whether we should have wcwidth() = 1 (because of
the default character used by the terminal in this case) or wcwidth() =
-1 (because of the XOpen spec) in this case, and I can't argue strongly
against the latter.

Another thing about the glibc locales that I am somewhat sceptical about
is that unassigned characters currently have iswprint() = 0. The BMP
will continue to be extended for many years to come, and almost all
characters that will go into currently unassigned slots will be
printable. The locales would remain considerably more stable as Unicode
evolves if iswprint(c) = 1 for all characters that are unassigned in the
respective latest Unicode DB.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to