Dnia czw 3. lipca 2003 17:52, srintuar26 napisał:
> Its thousands of times easier to use utf-8,
I will consider this. Here are disadvantages:
- Character predicates must work on strings and it's not obvious what part of
the string to feed to them. For example in a compiler/interpreter with
UTF-32 you can write a predicate for "allowed as the first character in
identifier" and "allowed in the rest of identifier", and take as many
characters as satisfy the predicate. With UTF-8 it's much worse, you must in
essence decode UTF-8 on the fly. You can't even implement "split on
whitespace" without UTF-8 decoding, because you don't know what part of the
string to test whether it's a whitespace character.
- Simple one-time-use programs which assume that characters are what you get
when you index strings (which break paragraphs or draw ASCII tables or count
occurrences of characters) are broken more often. They work only for ASCII,
where with UTF-32 they work for all text converted from ISO-8859-x and
more. Sometimes it's better to write a program which works at the given
place (although it would break for e.g. Arabic) than not be able to write it
conveniently at all.
> I would generally advise against doing what perl does: the decision to
> automatically consider all I/O to be text mode with automatic clumsy
> conversion has been a massive bug-generator. Its better to treat I/O as
> binary, and let the user call a conversion function if desired.
No, this would be awful, because users won't remember to do the conversion or
won't to the right one. Half of the program would use UTF-8, half would use
the encoding of the file, and it will break when they exchange non-ASCII
data.
I don't know how it should look like or what Perl does, but it's definitely
not an option to pretend that UTF-8 is used internally and at the same time
deliver file contents as unconverted strings. It would in essence make it the
3rd option, i.e. the encoding mess used internally we have today, and no
reliable non-ASCII character properties at all.
I can't imagine anything better than attaching encodings to text file objects.
Binary I/O will transfer arrays of bytes, text I/O will transfer Unicode
strings (with as yet undecided internal representation) or arrays of bytes
(when you just want to send them elsewhere without recoding twice).
Arrays of bytes should be clearly distinguished from strings.
> Another suggestion for your language: keep the source code in the same
> encoding as your strings. If you want a UTF-32 language, then require
> all source code to be encoded in that same encoding as well.
It makes no sense because it would require an Unicode-capable editor for
programs which use only ASCII. The source will have to be recodable at least
from UTF-8 and ISO-8859-x.
> Allow identifiers, comments, and literals to all be in that encoding so you
> dont get stuck in C++'s situation, where you have to use English for
> almost everything.
This is a separate thing. I'm writing the compiler in itself (I already have
an interpreter for a very large subset of the language to bootstrap it - the
interpreter lacks only libraries, and it was only the intermediate step to
experiment with the language and to be able to bootstrap the compiler).
When the compiler will be able to bootstrap itself, and the recoding framework
will be done, the source could be read and converted as other text files.
--
__("< Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/