Re: Strings in a programming language

Marcin 'Qrczak' Kowalczyk Thu, 03 Jul 2003 10:22:34 -0700

Dnia czw 3. lipca 2003 19:02, srintuar26 napisał:

> Well, for C++, white space are ' ', '\r', '\n', '\t'; its totally trivial.


Replace "whitespace" with "an arbitrary character predicate", e.g. for finding 
the end of an identifier.

> If you want to iterate over a string, dont use single codepoint indexing.

What to use instead? What should be the interface to a function which splits
a string into a list of strings by characters satisfying a predicate?

> (For example, Spanish in NFD wouldnt work even in UTF-32, because some
> letters would take two codepoints)

If someone chooses NFD, he will get what he deserves. It should not penalize 
all programmers.

> > No, this would be awful, because users won't remember to do the
> > conversion or won't to the right one. Half of the program would use
> > UTF-8, half would use the encoding of the file, and it will break when
> > they exchange non-ASCII data.
>
> Conversely, only the user can do the right conversion.

Only he can *choose* the right conversion, but it should be performed 
automatically.

> > It makes no sense because it would require an Unicode-capable editor for
> > programs which use only ASCII. The source will have to be recodable at
> > least from UTF-8 and ISO-8859-x.
>
> I agree: that being a strike against UTF-32, nes pa? If you cant even use
> it for the language itself, then why force it internally?

Because otherwise string API which uses indexing makes little sense (although 
we could live with that).

Encoding of files doesn't have to have antything to do with strings in memory, 
and it will always require some conversion - because I want to allow 
ISO-8859-2 source and at the same time I don't want to use multiple encodings 
of identifiers at once in the compiler.

> I think many people have yet to realize that the structure of unicode is
> inhernently biased towards multi-byte encodings. Unicode is multi-codepoint
> by design, and the conception that "encoding-unit == codepoint ==
> character" is fundamentally broken, and an illusion that may trap the
> unwary.

It's easier to map codepoints to characters than to map bytes to characters 
(which must go through codepoints anyway). What kind of string processing 
UTF-8 makes simpler than UTF-32?

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Strings in a programming language

Reply via email to