I looked a bit at the TCL/Tk 8.3.2 code from www.scriptics.com and I
found some odd things that should probably be checked, one of which is
most likely the reason for garbled iso10646-1 output.

1) Potential immediate problem: there might be a bigendian/littleendian
bug: tclUtf:Tcl_UtfToUniChar returns in *chPtr a Tcl_UniChar, which has
the CPU endianess (UCS-2), while Tk_DrawChars gets from
Tcl_UtfToExternalDString and therefore UtfToUnicodeProc and
Tcl_UtfToUniChar something back that XDrawString16 will interpret as a
sequence of bigendian 16-bit values (that is UCS-2BE for an iso10646-1
font).

    typedef struct {      /* normal 16 bit characters are two bytes */
        unsigned char byte1;  // row = high byte
        unsigned char byte2;  // column = low byte
    } XChar2b;

Suggestion: Do *not* use the encoding "unicode" but instead add a new
encoding called "UCS-2BE", which is the same but in bigendian, and that
you can cast to an array of XChar2b without risk. See the ISO 10646-1
standard for the official definition of "UCS-2".

By the way: ISO 10646-1:2000 can be ordered on CD-ROM for 80 CHF (~45 US$)
from http://www.iso.ch/cate/d29819.html in case you don't have a
copy yet on your desk.

I strongly recommend that the encoding name "unicode" be withdrawn.
Perhaps you want to make it an alias for "UTF-16", but certainly don't
use it in your code. Unicode is the name of a generic standard that can
be encoded in many ways. It is not the unambiguous name of a specific
byte-sequence encoding. Specific byte-sequence encodings are called
"UTF-8", "UTF-16", "UTF-32", "UCS-2", "UCS-4", "UTF-16BE", "UTF-32BE",
"UCS-2BE", "UCS-4BE", "UTF-16LE", "UTF-32LE", "UCS-2LE", "UCS-4LE",
depending on which subset of Unicode they cover how and whether the
endianess is systemspecific/bigendian/littleendian. If you are
unfamiliar with any single of these, then please do read ISO 10646-1 as
well as

  http://www-106.ibm.com/developerworks/library/utfencodingforms/
  http://www.unicode.org/unicode/reports/tr19/tr19-7.html

2) tkUnixFont.c:FontMapLoadPage says

        if ((hi < minHi) || (hi > maxHi) || (lo < checkLo) || (lo > maxLo)) {
            continue;
        }

which I suspect works only for 2-dimensional fonts, but not for linear
fonts (where maxLo > 255). But since the font you tested is 2-D and
linear, this is not the immediate problem.

3) Also, I get the impression that FontMapLoadPage checks only whether
the character to be displayed is in principle encodable using the
encoding of the font, but not whether the glyph is actually present in
the font. It would probably be better to check via the per_char array in
the XFontStruct of the font, whether the glyph is actually present.

Unlike most 8-bit or JIS fonts, iso10646-1 fonts are in practice never
complete. The standard grows continually, and not all scripts of the
planet can be accommodated in a single font with consistent style.
Iso10646-1 fonts are in practice however always more complete than fonts
of the same style in other encodings on the same system, so they should
be tried first.

At least the above hi/lo check understands that European iso10646-1
fonts usually do not contain anything above 0x31FF and therefore do not
cover Japanese and therefore a JIS font still has to be loaded as a
fallback. But for example, the routine as presently used does *not*
understand that most of the "bold" iso10646-1 fonts lack the
mathematical symbols and therefore it won't cause a fallback to "medium"
if a character is missing in the "bold" font.

4) I also noted that tclUtf:Tcl_UtfToUniChar accepts overlong UTF-8
sequences. This can be a security vulnerability and is forbidden in
Unicode 3.1. Practical example: a secure UTF-8 decoder must NOT accept any of

  0xc0 0x8A
  0xe0 0x80 0x8A
  0xf0 0x80 0x80 0x8A
  0xf8 0x80 0x80 0x80 0x8A
  0xfc 0x80 0x80 0x80 0x80 0x8A

as a valid encoding for U+000a, otherwise this could be used by
attackers to bypass ASCII-level integrity checks (e.g. string must me a
single line because it contains no 0x0a) before the UTF-8 decoder.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
http://www.unicode.org/unicode/reports/tr27/    (search for "UTF-8 Corrigendum")

I couldn't find the TCL bug track system any more, so I hope reporting
this to you is the appropriate thing to do.

Best regards,

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to