Hi Marcin,
> Most languages take 3, as I understand Perl it takes the mix of 3 and 2,
> and Python has both 3 and 1. I think I will take 1, but I need advice: -
Don't look at Perl in this case - Perl has the handicap that for historical
reasons it cannot make a clear distinctions between byte arrays (= binary
data) and character arrays (= strings = text).
Python's way of doing it - byte arrays are automatically converted to
character arrays when there is need to - is OK when you consider what
Python < 1.5 looked like. But for a freshly designed language it'd be
an unnecessary complexity.
In Lisp (Common Lisp - Scheme guys appear not to care about Unicode or
i18n) the common approach is to have one or two flavours of strings,
namely strings containing Unicode characters, and possibly a second
flavour, strings containing only ISO-8859-1 characters. Conversion
is done during I/O. The Lisp 'open' function has had an argument
'external-format' since 1984 or 1986 at least; nowadays a combination
of the encoding and the newline convention (Mac CR, Dos CRLF or Unix LF)
gets passed here.
You find details here:
- GNU clisp
http://clisp.sourceforge.net/impnotes/encoding.html
http://clisp.sourceforge.net/impnotes/stream-dict.html#open
- Allegro Common Lisp
http://www.franz.com/support/documentation/6.2/doc/iacl.htm
- Liquid Common Lisp
http://www.lispworks.com/reference/lcl50/ics/ics-1.html
- LispWorks
http://www.lispworks.com/reference/lwl42/LWRM-U/html/lwref-u-95.htm#pgfId-886156
http://www.lispworks.com/reference/lwl42/LWRM-U/html/lwref-u-101.htm#98500
http://www.lispworks.com/reference/lwl42/LWRM-U/html/lwref-u-76.htm#pgfId-902973
and some older details at
http://www.cliki.net/Unicode%20Support
> 1. Strings are in UTF-32 on Unix
Since strings are immutable in your language, you can also represent
strings as UCS-2 or ISO-8859-1 if possible; this saves 75% of the memory
in many cases, at the cost of a little more expensive element access.
> and UTF-16 on Windows. They are recoded on
> the fly during I/O and communication with an OS (e.g. in filenames),
> with some recoding framework to be designed.
Why not using UTF-32 as internal representation on Windows as well?
I mean, once you have decided to put in place a conversion layer for
I/O, this conversion layer can convert to UTF-16 on Windows. What you
gain: you have the same internal representation on all platforms.
> 2. Strings are in UTF-8, otherwise it's the same as the above. The
> programer can create malformed strings, they use byte offsets for indexing.
Unless you provide some built-in language constructs for safely iterating
across a string, like
for (c across-string: str) statement
this would be too cumbersome for the user who is not aware of i18n.
> - How should the conversion API look like? Are there other such APIs
> which I can look at? It should permit interfacing with iconv and other
> platform-specific converters, and with C/C++ libraries which use various
> conventions (locale-based encoding in most, UTF-16 in Qt, UTF-8 in Gtk).
The API typically has a data type 'external-format', consisting of
EOL and encoding. Then you have some functions for creating streams
(to files, pipes, sockets) which all take an 'external-format' argument.
Furthermore you need some functions for converting a string from/to
a byte sequence using an 'external-format'. (These can be methods on
the 'external-format' object.)
> - What other issues I will encounter?
People will want to switch the 'external-format' of a stream on the fly,
because in some protocols like HTTP some part of the data is binary and
other parts are text in a given encoding.
> The language is most similar to Dylan, but let's assume its purpose will be
> like Python's. It will have a compiler which produces C code
The following article might be interesting for you.
http://www.elwoodcorp.com/eclipse/papers/lugm98/lisp-c.html
Bruno
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/