Re: Strings in a programming language

Beni Cherniavsky Thu, 03 Jul 2003 10:05:10 -0700

Marcin 'Qrczak' Kowalczyk wrote on 2003-07-03:

> Dnia czw 3. lipca 2003 18:17, Beni Cherniavsky napisa�:
>
> > How much custom runtime libraries are you ready to write and use
> > (vs. compiling to calls to standard libraries)?
>
> I'm prepared to interface to iconv and such, and to different I/O systems on
> different platforms. I don't want to implement conversions from scratch.
> Character predicates and string handling will be implemented from stratch.
>
> I'm more afraid of requiring headaches from other people trying to interface
> to C libraries.
>
Please elaborate - interface from what to which C libraries?  Do you
mean here people writing C extensions to your langauge?


> > Are you ready to require sane i18n libraries (GNU libc, libiconv,
> > etc.) or do you want to work out-of-the-box with any unix vendor's
> > half-baked i18n facilities?
>
> It will probably have rich i18n support if libraries are available. With just
> ANSI C available it will fall back to minimal support for a few encodings.
>
> I'm not sure what are interesting things that these libraries provide, besides
> recoding and character predicates.
>
> > The last question repeated but more
> > amplified for windows - how well do you want to work with
>
> You haven't finished the sentence.
>
Indeed.  I wanted to ask how well should it work with the different
versions of windows, especially win9x with it's close-to-none Unicode
support?

> > 2.1. Strings are internally in UTF-8 but the programmer is presented
> >      with an illusion of character-level indexing.  This can be
> >      sub-optimal with some access patterns so you might want to
> >      provide some kind of iteration abstraction to reduce the use of
> >      indexing.
>
> I think it would be inefficient or horribly complicated. Ideas like caching
> the last character-to-byte mapping used in a string are not an option - it's
> way too ugly. I'm free to define what to index a string should mean, but it
> should be consistent with the physical representation.
>
OK.  Then if you want at all to expose an interface to string indexing
by unicode codepoints (but see srintuar26's comment), you are almost
bound to use UTF-32.  UTF-16 would have precisely the same indexing
problems as UTF-8.

> > If you write a recoding subsystem anyway, why would you ever want
> > to touch UTF-16?
>
> Because I think it's widely used in the Windows API. Unfortunately I don't
> program on Windows at all.
>
Neither do I, so don't take my comments too seriously ;).  Indeed,
AFAIK the only way to make unicode calls in Windows is to go through
UTF-16 (and most these calls will break on win9x, curse M$).  However
UTF-16 is badly inconvenient internally.  The question is between
internal effeciency vs. OS call effeciency - make your choice.

-- 
Beni Cherniavsky <[EMAIL PROTECTED]>

"Reading the documentation I felt like a kid in a toy shop."
 -- Phil Thompson on Python's standard library
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Strings in a programming language

Reply via email to