Re: ``A Short Into ...'' - comments, suggestions?

starner Wed, 11 Dec 2002 22:31:06 -0800

>This natural 2-byte encoding is called UCS-2.  Note that UCS-2 also
>contains \0 (embedded 0x00 bytes), and hence is not a C/C++ string.


What about UTF-16? Most UCS-2 systems now kludge in UTF-16, and can
handle to various degrees the full Unicode space.

>Various encodings of Unicode values which are more space-efficient
>than UCS-4 have been designed.  The best is definitely UTF-8, which
>is a self-synchronizing multi-byte encoding. 

There is no encoding of Unicode that is definitely the best. UTF-16
is more space-efficient than UTF-8 for CJK scripts. SCSU (which is
out of the scope of this note) is more space-efficent then UTF-16
and UTF-8 for just about every string, but is much more complex.

>UTF-8 is a C/C++ string.

UTF-8 is a C string compatible encoding.

>Unlike UCS-2, UTF-8 can also encode the entire 31-bit Unicode space.

20.1-bit space. And UTF-16 (which has mostly surplanted UCS-2, even
if the UTF-16 support is still a bit beta) can encode the same 20.1
space. Those characters UTF-16 doesn't support aren't going to be
used for Unicode characters.

>The issue with [ FS ] and [ Java/JVM ] is that both are naturally
>Unicode-based.

I have no idea what FS is; it doesn't sound like any filesystem I'm
familar with.

>[ FS ], like VFAT, understands the C/C++ string ?:30b9:30c3:2022txt?
>to mean the above six Unicode characters.  That is, [ FS ] uses another
>multi-byte encoding, which it calls the ?Linux encoding?, to allow use
>of almost the full Unicode set U+0001 to U+FFFF available for filenames.

It shouldn't call it the Linux encoding; for most purposes, the Linux 
Unicode encoding is UTF-8, notably including filenames. Most of us
on this list have UTF-8 named files, if only for the sake of making
sure they work. Your filesystem is broken; the proper thing to do
(albeit not necessarily easy or practical) is to replace it with
a POSIX filesytem (which can be swapped transparently, for most
purposes, and which handles UTF-8 transparently.)

>The Linux encoding used by [ FS ] is a hack.  It is not as space-
>efficient as UTF-8, nor is it a de jour (official) standard.

It's not a de facto standard, either. In the several years I've
been on this list and [EMAIL PROTECTED], I've never heard of
it.

>The Linux encoding is used because the system is not ready for UTF-8.

What is "the system" here? It sure as heck isn't Linux. Again I object
to the misnomer "Linux Encoding".

>The Linux encoding only uses US-ASCII, unlike UTF-8.

You just explained how it encoded U+0080-U+00FF as bytes. So it doesn't use
US-ASCII, it uses Latin-1; according to your description, it only works
right under Latin-1.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: ``A Short Into ...'' - comments, suggestions?

Reply via email to