Dnia pią 4. lipca 2003 04:25, Wu Yongwei napisał:
> I wonder, how many people really want to use Unicode codepoints beyond
> U+FFFF?
I don't want to make it incorrect by design just because cases it doesn't
handle are rare.
> However, UTF-16 is harder to process than UTF-8 (I think you are wrong
> on this point). For UTF-8 you will not split in the middle of a
> sequence by mistake (encoding guarantees it), but for UTF-16 it is much
> more likely to happen.
Why? In both cases it can happen unless the library does extra checks to
prevent it, and they are equally easy. I'm not saying that I would make these
checks - probably it's easier to deal with the possibility of broken strings.
> Similarly, it is safe to search for a valid UTF-8 character in a UTF-8
> sequence using C functions like strstr. This is a great advantage.
Not for me. This is my language, not C. Strings are usually-heap-allocated
objects with an object header of one machine word and explicit length.
C is used only as a portable assembler and glue to libraries.
> '\0' is not a valid character; it is a byte. Don't confuse them.
It can occur in a file. The library can't just silently truncate a line when
it encounters that - again, although it's rare, it would be broken by design,
so I won't do that. I won't be mistake-compatible with C.
--
__("< Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/