Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03:
> I'm designing and implementing a programming language for fun. I
> must decide how strings should be handled.
>
> The language is most similar to Dylan, but let's assume its purpose
> will be like Python's. It will have a compiler which produces C code
> and should work at least on Linux and Windows.
>
How optimized should it be? How much custom runtime libraries are you
ready to write and use (vs. compiling to calls to standard libraries)?
Are you ready to require sane i18n libraries (GNU libc, libiconv,
etc.) or do you want to work out-of-the-box with any unix vendor's
half-baked i18n facilities? The last question repeated but more
amplified for windows - how well do you want to work with
> What is already decided: There is no separate character type,
> strings of length 1 are used to represent characters. Strings are
> immutable. The programmer will sometimes use indices to address
> parts of strings, string handling will not be heavily regexp-based
> as in Perl. File I/O on Unix will use the Unix API, not stdio, but
> stdio could be the optional last resort for portability.
>
> Here are the choices I imagine:
>
> 1. Strings are in UTF-32 on Unix and UTF-16 on Windows. They are
> recoded on the fly during I/O and communication with an OS (e.g.
> in filenames), with some recoding framework to be designed.
> Takes the default encoding from the locale. For binary I/O and
> for implementing the conversions there are byte arrays separate
> from strings. The source code can be annotated with its encoding
> for non-ASCII characters in string literals.
>
Why can't the source always be in UTF-8? I'm not aware of any Windows
editors that can handle UTF-{16,32}{,BE,LE} but not UTF-8. Or do you
refer to allowing unibyte encodings (latin-*, cp-*)?
> 2. Strings are in UTF-8, otherwise it's the same as the above. The
> programer can create malformed strings, they use byte offsets for
> indexing.
>
2.1. Strings are internally in UTF-8 but the programmer is presented
with an illusion of character-level indexing. This can be
sub-optimal with some access patterns so you might want to
provide some kind of iteration abstraction to reduce the use of
indexing.
> 3. Strings are sequences of bytes in an unspecified ASCII-based encoding.
> The programmer is responsible for converting them as appropriate.
>
The easy road for you but the hard one for the programmer. However,
you can't escape the fact that the programmer will interface to
formats/protocols whose encoding only he can determine, so you should
leave him the option of manually encoding/decoding texts between 3 and
1 or 2.
> Most languages take 3, as I understand Perl it takes the mix of 3 and 2, and
> Python has both 3 and 1. I think I will take 1,
As said above, I don't see any way to take *only* 1 (or 2), without 3.
> but I need advice:
> - Are there other reasonable ways?
> - Is it good to use UTF-32 on Unix and UTF-16 on Windows?
Absolutely, if you want headache ;-). If you write a recoding
subsystem anyway, why would you ever want to touch UTF-16? IMHO,
either choose UTF-32 or UTF-8. And allow yourself to depend on the
encoding for internal uses.
> - How should the conversion API look like? Are there other such APIs
> which I can look at? It should permit interfacing with iconv and other
> platform-specific converters, and with C/C++ libraries which use various
> conventions (locale-based encoding in most, UTF-16 in Qt, UTF-8 in Gtk).
> - What other issues I will encounter?
>
--
Beni Cherniavsky <[EMAIL PROTECTED]>
"Reading the documentation I felt like a kid in a toy shop."
-- Phil Thompson on Python's standard library
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/