At 10:00 AM 05/12/2001 +0100, Markus Kuhn wrote:
 > We have now in wchar_t a nice infrastructure for handling 31-bit
 > characters, and I do urge all implementors of UTF-8 encoders and
 > decoders to keep them fully 31-bit transparent. UTF-16 is pretty
 > irrelevant to the GNU/POSIX platform. The wc API was not designed to
 > handle double-double-byte characters such as surrogate pairs. Why should
 > Linux programmers destroy the potentially useful full 31-bit space, just
 > because of silly interoperability concerns by the UTF-16 crowd? They are
 > just applying flawed logic IMHO: Private use characters are per
 > definition non-interoperable anyway, independent whether they can be
 > represented in Word doc files or not.

There's a difference between a valid UTF-8 file that can read on any system
that handles UTF-8 and lets you chose your own fonts, and an invalid UTF-8
file that any conforming UTF-8 interpreter will just choke on. Handling only
valid UTF-8 offers many advantages; you're guaranteed to be able to convert
it to UTF-16 and SCSU, and you know you have an invalid file if you have a
character outside 0 .. 10FFFF, which is very useful for internal and external
uses of UTF-32, and it adds another rare case to write, test and debug. All
for minimal gain. For those reason, Ngeadal will continue to handle only
valid UTF-8. (As if any one actually uses Ngeadal, but there is hope in the
future . . . )

If someone wants to make libc use this extended UTF-8, then that's
their business. Anything external that claims to be UTF-8 should be valid
UTF-8, so any program that handles UTF-8 can handle it, even if we're
just talking wordwrapping or normalization.

-- 
David Starner <[EMAIL PROTECTED]>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to