>This natural 2-byte encoding is called UCS-2. Note that UCS-2 also >contains \0 (embedded 0x00 bytes), and hence is not a C/C++ string.
What about UTF-16? Most UCS-2 systems now kludge in UTF-16, and can handle to various degrees the full Unicode space. >Various encodings of Unicode values which are more space-efficient >than UCS-4 have been designed. The best is definitely UTF-8, which >is a self-synchronizing multi-byte encoding. There is no encoding of Unicode that is definitely the best. UTF-16 is more space-efficient than UTF-8 for CJK scripts. SCSU (which is out of the scope of this note) is more space-efficent then UTF-16 and UTF-8 for just about every string, but is much more complex. >UTF-8 is a C/C++ string. UTF-8 is a C string compatible encoding. >Unlike UCS-2, UTF-8 can also encode the entire 31-bit Unicode space. 20.1-bit space. And UTF-16 (which has mostly surplanted UCS-2, even if the UTF-16 support is still a bit beta) can encode the same 20.1 space. Those characters UTF-16 doesn't support aren't going to be used for Unicode characters. >The issue with [ FS ] and [ Java/JVM ] is that both are naturally >Unicode-based. I have no idea what FS is; it doesn't sound like any filesystem I'm familar with. >[ FS ], like VFAT, understands the C/C++ string ?:30b9:30c3:2022txt? >to mean the above six Unicode characters. That is, [ FS ] uses another >multi-byte encoding, which it calls the ?Linux encoding?, to allow use >of almost the full Unicode set U+0001 to U+FFFF available for filenames. It shouldn't call it the Linux encoding; for most purposes, the Linux Unicode encoding is UTF-8, notably including filenames. Most of us on this list have UTF-8 named files, if only for the sake of making sure they work. Your filesystem is broken; the proper thing to do (albeit not necessarily easy or practical) is to replace it with a POSIX filesytem (which can be swapped transparently, for most purposes, and which handles UTF-8 transparently.) >The Linux encoding used by [ FS ] is a hack. It is not as space- >efficient as UTF-8, nor is it a de jour (official) standard. It's not a de facto standard, either. In the several years I've been on this list and [EMAIL PROTECTED], I've never heard of it. >The Linux encoding is used because the system is not ready for UTF-8. What is "the system" here? It sure as heck isn't Linux. Again I object to the misnomer "Linux Encoding". >The Linux encoding only uses US-ASCII, unlike UTF-8. You just explained how it encoded U+0080-U+00FF as bytes. So it doesn't use US-ASCII, it uses Latin-1; according to your description, it only works right under Latin-1. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
