Re: Strings in a programming language

srintuar26 Thu, 03 Jul 2003 08:49:03 -0700


> - Are there other reasonable ways?
> - Is it good to use UTF-32 on Unix and UTF-16 on Windows?


I'd advise against it. There is really no advantage to using UTF-32
anymore. It wont tell you where its safe to break strings, you can still
have invalid sequences, some characters will still require multiple
codepoints, and you will most certainly still need byte streams for i/o
and serialization, youll possibly have to deal with endianness problems
if you want portability,

Its thousands of times easier to use utf-8, even on windows, imo. Most
of the time you can just ignore the fact that its in utf-8, and treat it
as an old-fashioned ascii string. When you need to do one of the
operations that require special handling, its no more work than youd
have to do for utf-32.

I would generally advise against doing what perl does: the decision to
automatically consider all I/O to be text mode with automatic clumsy
conversion has been a massive bug-generator. Its better to treat I/O as
binary, and let the user call a conversion function if desired.

Another suggestion for your language: keep the source code in the same
encoding as your strings. If you want a UTF-32 language, then require
all source code to be encoded in that same encoding as well. Allow
identifiers, comments, and literals to all be in that encoding so you
dont get stuck in C++'s situation, where you have to use English for
almost everything.

(I personally thing C++ should allow utf-8 everywhere, with no compiler
enforced normalization form whatsoever.)

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Strings in a programming language

Reply via email to