Working on uuterm[1], I've run into a problem with the characters 0D4A-0D4C and possibly others like them, in regards to wcwidth(3) behavior. These characters are combining marks that attach on both sides of a cluster, and have canonical equivalence to the two separate pieces from which they are built, but yet Markus' wcwidth implementation and GNU libc assign them a width of 1. It appears very obvious to me that there's no hope of rendering both of these parts using only 1 character cell on a character cell device, and even if it were possible, it also seems horribly wrong for canonically equivalent strings to have different widths.
I propose amending the wcwidth definitions to assign these characters (and any like them) a width of 2. Furthermore, I would suggest that any characters with canonical decompositions be assigned a width that is the sum of the widths of the decomposition into NFD. This would avoid similar unfortunate situations in the future that might not yet have been found. It may also be desirable to do this for compatibility decompositions (like "dz", etc.); however I suspect it's unlikely that anyone would use such characters in non-legacy data anyway. BTW I don't think there's any harm here in breaking compatibility with existing practice, since obviously no one is using the results of wcwidth on these characters or they would already have run into thus problem.. Rich [1] http://svn.mplayerhq.hu/uuterm/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
