Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Rich Felker Fri, 13 Oct 2006 21:20:32 -0700

Working on uuterm[1], I've run into a problem with the characters
0D4A-0D4C and possibly others like them, in regards to wcwidth(3)
behavior. These characters are combining marks that attach on both
sides of a cluster, and have canonical equivalence to the two separate
pieces from which they are built, but yet Markus' wcwidth
implementation and GNU libc assign them a width of 1. It appears very
obvious to me that there's no hope of rendering both of these parts
using only 1 character cell on a character cell device, and even if it
were possible, it also seems horribly wrong for canonically equivalent
strings to have different widths.


I propose amending the wcwidth definitions to assign these characters
(and any like them) a width of 2. Furthermore, I would suggest that
any characters with canonical decompositions be assigned a width that
is the sum of the widths of the decomposition into NFD. This would
avoid similar unfortunate situations in the future that might not yet
have been found. It may also be desirable to do this for compatibility
decompositions (like "dz", etc.); however I suspect it's unlikely that
anyone would use such characters in non-legacy data anyway.

BTW I don't think there's any harm here in breaking compatibility with
existing practice, since obviously no one is using the results of
wcwidth on these characters or they would already have run into thus
problem..

Rich


[1] http://svn.mplayerhq.hu/uuterm/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Proposed fix for Malayalam (& other Indic?) chars and wcwidth

Reply via email to