Dnia czw 3. lipca 2003 18:50, Beni Cherniavsky napisał:

> Such programs would break anyway, with non-spacing marks and wide
> characters.

But they would at least work with ISO-8859-x files. Not all programs are 
expected to be used with all possible encodings, and using UTF-32 internally 
would not make writing any programs harder. And a library must convert for 
I/O and filenames anyway, the world is not limited to UTF-8.

What are advantages of UTF-8 besides memory usage and the ability to skip the 
conversion in case the external encoding happens to be UTF-8 too?

With UTF-32 I can use a function which searches for the next character index 
satisfying a predicate for many parsing-related tasks, and others like "cut 
the beginning of the string composed of characters satisfying a predicate". 
With UTF-8 these function must be UTF-8-aware so they can iterate over 
characters, and they must do something with invalid sequences.

> You can do the right thing for the current locale by default but you
> should povide a way for the programmer to override this.

Sure.

> I think the Python approach is optimal: there are two separate kinds
> of strings, one for raw undecoded bytes and another for Unicode
> characters.  They are translated automatically but there is no way to
> incidentially make such a mistake as decoding the result of a decoded
> string or encoding an already encoded string.

I like the approach, except that I plan to use Unicode strings by default, 
have only Unicode string literals, and use byte arrays only for manual 
control over the conversion, for binary I/O, for shortcuts when copying data 
from one file to another when both have the same encoding, etc.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to