I'm designing and implementing a programming language for fun. I must decide
how strings should be handled.
The language is most similar to Dylan, but let's assume its purpose will be
like Python's. It will have a compiler which produces C code and should work
at least on Linux and Windows.
What is already decided: There is no separate character type, strings of
length 1 are used to represent characters. Strings are immutable. The
programmer will sometimes use indices to address parts of strings, string
handling will not be heavily regexp-based as in Perl. File I/O on Unix will
use the Unix API, not stdio, but stdio could be the optional last resort for
portability.
Here are the choices I imagine:
1. Strings are in UTF-32 on Unix and UTF-16 on Windows. They are recoded on
the fly during I/O and communication with an OS (e.g. in filenames), with
some recoding framework to be designed. Takes the default encoding from the
locale. For binary I/O and for implementing the conversions there are byte
arrays separate from strings. The source code can be annotated with its
encoding for non-ASCII characters in string literals.
2. Strings are in UTF-8, otherwise it's the same as the above. The programer
can create malformed strings, they use byte offsets for indexing.
3. Strings are sequences of bytes in an unspecified ASCII-based encoding.
The programmer is responsible for converting them as appropriate.
Most languages take 3, as I understand Perl it takes the mix of 3 and 2, and
Python has both 3 and 1. I think I will take 1, but I need advice:
- Are there other reasonable ways?
- Is it good to use UTF-32 on Unix and UTF-16 on Windows?
- How should the conversion API look like? Are there other such APIs
which I can look at? It should permit interfacing with iconv and other
platform-specific converters, and with C/C++ libraries which use various
conventions (locale-based encoding in most, UTF-16 in Qt, UTF-8 in Gtk).
- What other issues I will encounter?
--
__("< Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/