Dnia czw 3. lipca 2003 18:50, Beni Cherniavsky napisał:
> Such programs would break anyway, with non-spacing marks and wide
> characters.
But they would at least work with ISO-8859-x files. Not all programs are
expected to be used with all possible encodings, and using UTF-32 internally
would not make writing any programs harder. And a library must convert for
I/O and filenames anyway, the world is not limited to UTF-8.
What are advantages of UTF-8 besides memory usage and the ability to skip the
conversion in case the external encoding happens to be UTF-8 too?
With UTF-32 I can use a function which searches for the next character index
satisfying a predicate for many parsing-related tasks, and others like "cut
the beginning of the string composed of characters satisfying a predicate".
With UTF-8 these function must be UTF-8-aware so they can iterate over
characters, and they must do something with invalid sequences.
> You can do the right thing for the current locale by default but you
> should povide a way for the programmer to override this.
Sure.
> I think the Python approach is optimal: there are two separate kinds
> of strings, one for raw undecoded bytes and another for Unicode
> characters. They are translated automatically but there is no way to
> incidentially make such a mistake as decoding the result of a decoded
> string or encoding an already encoded string.
I like the approach, except that I plan to use Unicode strings by default,
have only Unicode string literals, and use byte arrays only for manual
control over the conversion, for binary I/O, for shortcuts when copying data
from one file to another when both have the same encoding, etc.
--
__("< Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/