Re: Strings in a programming language

Marcin 'Qrczak' Kowalczyk Thu, 03 Jul 2003 09:24:47 -0700

Dnia czw 3. lipca 2003 17:52, srintuar26 napisał:

> Its thousands of times easier to use utf-8,


I will consider this. Here are disadvantages:

- Character predicates must work on strings and it's not obvious what part of
  the string to feed to them. For example in a compiler/interpreter with
  UTF-32 you can write a predicate for "allowed as the first character in
  identifier" and "allowed in the rest of identifier", and take as many
  characters as satisfy the predicate. With UTF-8 it's much worse, you must in
  essence decode UTF-8 on the fly. You can't even implement "split on
  whitespace" without UTF-8 decoding, because you don't know what part of the
  string to test whether it's a whitespace character.

- Simple one-time-use programs which assume that characters are what you get
  when you index strings (which break paragraphs or draw ASCII tables or count
  occurrences of characters) are broken more often. They work only for ASCII,
  where with UTF-32 they work for all text converted from ISO-8859-x and
  more. Sometimes it's better to write a program which works at the given
  place (although it would break for e.g. Arabic) than not be able to write it
  conveniently at all.

> I would generally advise against doing what perl does: the decision to
> automatically consider all I/O to be text mode with automatic clumsy
> conversion has been a massive bug-generator. Its better to treat I/O as
> binary, and let the user call a conversion function if desired.

No, this would be awful, because users won't remember to do the conversion or 
won't to the right one. Half of the program would use UTF-8, half would use 
the encoding of the file, and it will break when they exchange non-ASCII 
data.

I don't know how it should look like or what Perl does, but it's definitely 
not an option to pretend that UTF-8 is used internally and at the same time 
deliver file contents as unconverted strings. It would in essence make it the 
3rd option, i.e. the encoding mess used internally we have today, and no 
reliable non-ASCII character properties at all.

I can't imagine anything better than attaching encodings to text file objects.
Binary I/O will transfer arrays of bytes, text I/O will transfer Unicode 
strings (with as yet undecided internal representation) or arrays of bytes 
(when you just want to send them elsewhere without recoding twice).
Arrays of bytes should be clearly distinguished from strings.

> Another suggestion for your language: keep the source code in the same
> encoding as your strings. If you want a UTF-32 language, then require
> all source code to be encoded in that same encoding as well.

It makes no sense because it would require an Unicode-capable editor for 
programs which use only ASCII. The source will have to be recodable at least 
from UTF-8 and ISO-8859-x.

> Allow identifiers, comments, and literals to all be in that encoding so you
> dont get stuck in C++'s situation, where you have to use English for
> almost everything.

This is a separate thing. I'm writing the compiler in itself (I already have 
an interpreter for a very large subset of the language to bootstrap it - the 
interpreter lacks only libraries, and it was only the intermediate step to 
experiment with the language and to be able to bootstrap the compiler).
When the compiler will be able to bootstrap itself, and the recoding framework 
will be done, the source could be read and converted as other text files.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Strings in a programming language

Reply via email to