Re: Strings in a programming language

Marcin 'Qrczak' Kowalczyk Fri, 04 Jul 2003 04:42:18 -0700

Dnia czw 3. lipca 2003 20:48, Bruno Haible napisał:

> You find details here:
[...]


Thanks! This is what I was looking for.

> Since strings are immutable in your language, you can also represent
> strings as UCS-2 or ISO-8859-1 if possible; this saves 75% of the memory
> in many cases, at the cost of a little more expensive element access.

I was thinking about it. Is it worth the effort? For example when converting a 
string read from file, I won't know beforehand if it will fit (except when 
the encoding is ISO-8859-1), so I would have to try to convert to ISO-8859-1 
first, while switching to UTF-32 as soon as a character above U+00FF is 
encountered.

So many complications only to save some space... Perhaps it will be enough 
when people are able to work on unconverted byte strings when really needed.

> Why not using UTF-32 as internal representation on Windows as well?

Well, ok :-)

> Unless you provide some built-in language constructs for safely iterating
> across a string, like
>
>       for (c across-string: str) statement
>
> this would be too cumbersome for the user who is not aware of i18n.

Indeed, being persuaded here to use UTF-8 I was trying to imagine how it would 
look like. Unfortunately executing a statement for every character (which is 
very easy to solve) is not enough...

The preferred way to do complicated finding and cutting parts of strings is to 
just work with indices. A generic iteration protocol is used only for forward 
iteration step by step, there are no random-access iterators which bundle a 
collection with its index. This approach is not compatible with UTF-8 - 
people would not remember to use something like "nextChar s i" instead of 
"i+1", and the error would be unnoticed for ASCII strings.

Moreover, exposing the fact that strings are encoded in UTF-8 would make the 
API uglier in every respect. There would be constant errors which don't show 
up until non-ASCII characters are encountered, e.g. when isUpper is called 
with the first byte instead of the first code point. Implementing Unicode 
algoritms like collation and normalizing in terms of UTF-8 is harder, since 
they are obviously specified in terms of code points.

> The API typically has a data type 'external-format', consisting of
> EOL and encoding. Then you have some functions for creating streams
> (to files, pipes, sockets) which all take an 'external-format' argument.
>
> Furthermore you need some functions for converting a string from/to
> a byte sequence using an 'external-format'. (These can be methods on
> the 'external-format' object.)

Right, this is what I'm planning. The devil is in the details :-)

> > - What other issues I will encounter?
>
> People will want to switch the 'external-format' of a stream on the fly,
> because in some protocols like HTTP some part of the data is binary and
> other parts are text in a given encoding.

Yep.

> The following article might be interesting for you.
> http://www.elwoodcorp.com/eclipse/papers/lugm98/lisp-c.html

Interesting and unfortunately not very applicable for me. Because of tail 
calls, accurate copying garbage collection, showing stack trace on error, 
resizing stack on overflow - the integration with C can't be so tight.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Strings in a programming language

Reply via email to