I'm pretty sure that all Windows versions since Win95 allow to convert between UTF-16 and the encoding used for filenames, and WinNT allows to use UTF-16 filenames directly. So for filenames and the like, texts on screen, texts exchanged with databases etc. any Unicode encoding can be made working, with some advantage for UTF-16. I consider making this the default for Windows because Microsoft already seems to do so. I hate UTF-16 personally but in this case I can blame Microsoft for making harder to process characters beyond U+FFFF. It's not harder than UTF-8 anyway.
I wonder, how many people really want to use Unicode codepoints beyond U+FFFF? Personally I think the Microsoft way is OK, and allocating 32 bits for a wide character is overkill. Using UTF-16 is already a waste of memory for Westerners (compared to UTF-8), and using UTF-32 (this time UTF-8 as well) is a waste for Asians (don't say memory is not an issue: memory, IMHO, is always an issue). In my opinion, I would simply choose to implement a 16-bit Unicode subset, without considering the complexities of UTF-16.
However, UTF-16 is harder to process than UTF-8 (I think you are wrong on this point). For UTF-8 you will not split in the middle of a sequence by mistake (encoding guarantees it), but for UTF-16 it is much more likely to happen. Similarly, it is safe to search for a valid UTF-8 character in a UTF-8 sequence using C functions like strstr. This is a great advantage.
It's not simpler, it's the same. Since the language is not C, it doesn't matter that C provides some functions working on text - they won't be used anyway because they break on '\0' characters (and ANSI C doesn't provide anything interesting).
How can it be the same? Don't you use the C library too? Don't you want to reuse and make the compiled code smaller?
'\0' is not a valid character; it is a byte. Don't confuse them. If you want to use UTF-32/16 internally, you must have a separate implementation for byte arrays. '\0' should be only there.
Best regards,
Wu Yongwei
-- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
