Re: Strings in a programming language

Wu Yongwei Thu, 03 Jul 2003 19:28:27 -0700

Marcin 'Qrczak' Kowalczyk wrote:

I'm pretty sure that all Windows versions since Win95 allow to
convert between UTF-16 and the encoding used for filenames, and WinNT
allows to use UTF-16 filenames directly. So for filenames and the
like, texts on screen, texts exchanged with databases etc. any
Unicode encoding can be made working, with some advantage for UTF-16.
I consider making this the default for Windows because Microsoft
already seems to do so. I hate UTF-16 personally but in this case I
can blame Microsoft for making harder to process characters beyond
U+FFFF. It's not harder than UTF-8 anyway.

I wonder, how many people really want to use Unicode codepoints beyond
U+FFFF?  Personally I think the Microsoft way is OK, and allocating 32
bits for a wide character is overkill.  Using UTF-16 is already a waste
of memory for Westerners (compared to UTF-8), and using UTF-32 (this
time UTF-8 as well) is a waste for Asians (don't say memory is not an
issue: memory, IMHO, is always an issue).  In my opinion, I would simply
choose to implement a 16-bit Unicode subset, without considering the
complexities of UTF-16.

However, UTF-16 is harder to process than UTF-8 (I think you are wrong
on this point).  For UTF-8 you will not split in the middle of a
sequence by mistake (encoding guarantees it), but for UTF-16 it is much
more likely to happen.  Similarly, it is safe to search for a valid
UTF-8 character in a UTF-8 sequence using C functions like strstr.  This
is a great advantage.

It's not simpler, it's the same. Since the language is not C, it
doesn't matter that C provides some functions working on text - they
won't be used anyway because they break on '\0' characters (and ANSI
C doesn't provide anything interesting).

How can it be the same?  Don't you use the C library too?  Don't you
want to reuse and make the compiled code smaller?

'\0' is not a valid character; it is a byte.  Don't confuse them.  If
you want to use UTF-32/16 internally, you must have a separate
implementation for byte arrays.  '\0' should be only there.

Best regards,

Wu Yongwei


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Strings in a programming language

Reply via email to