Dnia pią 4. lipca 2003 04:25, Wu Yongwei napisał:

> I wonder, how many people really want to use Unicode codepoints beyond
> U+FFFF?

I don't want to make it incorrect by design just because cases it doesn't 
handle are rare.

> However, UTF-16 is harder to process than UTF-8 (I think you are wrong
> on this point).  For UTF-8 you will not split in the middle of a
> sequence by mistake (encoding guarantees it), but for UTF-16 it is much
> more likely to happen.

Why? In both cases it can happen unless the library does extra checks to 
prevent it, and they are equally easy. I'm not saying that I would make these 
checks - probably it's easier to deal with the possibility of broken strings.

> Similarly, it is safe to search for a valid UTF-8 character in a UTF-8
> sequence using C functions like strstr.  This is a great advantage.

Not for me. This is my language, not C. Strings are usually-heap-allocated 
objects with an object header of one machine word and explicit length.
C is used only as a portable assembler and glue to libraries.

> '\0' is not a valid character; it is a byte.  Don't confuse them.

It can occur in a file. The library can't just silently truncate a line when 
it encounters that - again, although it's rare, it would be broken by design, 
so I won't do that. I won't be mistake-compatible with C.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to