Re: Strings in a programming language

Bruno Haible Thu, 03 Jul 2003 11:50:01 -0700

Hi Marcin,

> Most languages take 3, as I understand Perl it takes the mix of 3 and 2,
> and Python has both 3 and 1. I think I will take 1, but I need advice: -


Don't look at Perl in this case - Perl has the handicap that for historical
reasons it cannot make a clear distinctions between byte arrays (= binary
data) and character arrays (= strings = text).

Python's way of doing it - byte arrays are automatically converted to
character arrays when there is need to - is OK when you consider what
Python < 1.5 looked like. But for a freshly designed language it'd be
an unnecessary complexity.

In Lisp (Common Lisp - Scheme guys appear not to care about Unicode or
i18n) the common approach is to have one or two flavours of strings,
namely strings containing Unicode characters, and possibly a second
flavour, strings containing only ISO-8859-1 characters. Conversion
is done during I/O. The Lisp 'open' function has had an argument
'external-format' since 1984 or 1986 at least; nowadays a combination
of the encoding and the newline convention (Mac CR, Dos CRLF or Unix LF)
gets passed here.

You find details here:

  - GNU clisp
    http://clisp.sourceforge.net/impnotes/encoding.html
    http://clisp.sourceforge.net/impnotes/stream-dict.html#open

  - Allegro Common Lisp
    http://www.franz.com/support/documentation/6.2/doc/iacl.htm

  - Liquid Common Lisp
    http://www.lispworks.com/reference/lcl50/ics/ics-1.html

  - LispWorks
    http://www.lispworks.com/reference/lwl42/LWRM-U/html/lwref-u-95.htm#pgfId-886156
    http://www.lispworks.com/reference/lwl42/LWRM-U/html/lwref-u-101.htm#98500
    http://www.lispworks.com/reference/lwl42/LWRM-U/html/lwref-u-76.htm#pgfId-902973

and some older details at
    http://www.cliki.net/Unicode%20Support

> 1. Strings are in UTF-32 on Unix

Since strings are immutable in your language, you can also represent
strings as UCS-2 or ISO-8859-1 if possible; this saves 75% of the memory
in many cases, at the cost of a little more expensive element access.

>    and UTF-16 on Windows. They are recoded on
>    the fly during I/O and communication with an OS (e.g. in filenames),
> with some recoding framework to be designed.

Why not using UTF-32 as internal representation on Windows as well?
I mean, once you have decided to put in place a conversion layer for
I/O, this conversion layer can convert to UTF-16 on Windows. What you
gain: you have the same internal representation on all platforms.

> 2. Strings are in UTF-8, otherwise it's the same as the above. The
> programer can create malformed strings, they use byte offsets for indexing.

Unless you provide some built-in language constructs for safely iterating
across a string, like

      for (c across-string: str) statement

this would be too cumbersome for the user who is not aware of i18n.

> - How should the conversion API look like? Are there other such APIs
>   which I can look at? It should permit interfacing with iconv and other
>   platform-specific converters, and with C/C++ libraries which use various
>   conventions (locale-based encoding in most, UTF-16 in Qt, UTF-8 in Gtk).

The API typically has a data type 'external-format', consisting of
EOL and encoding. Then you have some functions for creating streams
(to files, pipes, sockets) which all take an 'external-format' argument.

Furthermore you need some functions for converting a string from/to
a byte sequence using an 'external-format'. (These can be methods on
the 'external-format' object.)

> - What other issues I will encounter?

People will want to switch the 'external-format' of a stream on the fly,
because in some protocols like HTTP some part of the data is binary and
other parts are text in a given encoding.

> The language is most similar to Dylan, but let's assume its purpose will be
> like Python's. It will have a compiler which produces C code

The following article might be interesting for you.
http://www.elwoodcorp.com/eclipse/papers/lugm98/lisp-c.html

Bruno

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Strings in a programming language

Reply via email to