Strings in a programming language

Marcin 'Qrczak' Kowalczyk Thu, 03 Jul 2003 08:25:48 -0700

I'm designing and implementing a programming language for fun. I must decide 
how strings should be handled.


The language is most similar to Dylan, but let's assume its purpose will be 
like Python's. It will have a compiler which produces C code and should work 
at least on Linux and Windows.

What is already decided: There is no separate character type, strings of 
length 1 are used to represent characters. Strings are immutable. The 
programmer will sometimes use indices to address parts of strings, string 
handling will not be heavily regexp-based as in Perl. File I/O on Unix will 
use the Unix API, not stdio, but stdio could be the optional last resort for 
portability.

Here are the choices I imagine:

1. Strings are in UTF-32 on Unix and UTF-16 on Windows. They are recoded on
   the fly during I/O and communication with an OS (e.g. in filenames), with
   some recoding framework to be designed. Takes the default encoding from the
   locale. For binary I/O and for implementing the conversions there are byte
   arrays separate from strings. The source code can be annotated with its
   encoding for non-ASCII characters in string literals.

2. Strings are in UTF-8, otherwise it's the same as the above. The programer
   can create malformed strings, they use byte offsets for indexing.

3. Strings are sequences of bytes in an unspecified ASCII-based encoding.
   The programmer is responsible for converting them as appropriate.

Most languages take 3, as I understand Perl it takes the mix of 3 and 2, and 
Python has both 3 and 1. I think I will take 1, but I need advice:
- Are there other reasonable ways?
- Is it good to use UTF-32 on Unix and UTF-16 on Windows?
- How should the conversion API look like? Are there other such APIs
  which I can look at? It should permit interfacing with iconv and other
  platform-specific converters, and with C/C++ libraries which use various
  conventions (locale-based encoding in most, UTF-16 in Qt, UTF-8 in Gtk).
- What other issues I will encounter?

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Strings in a programming language

Reply via email to