Re: Strings in a programming language

srintuar26 Thu, 03 Jul 2003 09:59:03 -0700


> I will consider this. Here are disadvantages:
> 
>   With UTF-8 it's much worse, you must in
>   essence decode UTF-8 on the fly. You can't even implement "split on
>   whitespace" without UTF-8 decoding, because you don't know what part of the
>   string to test whether it's a whitespace character.


Well, for C++, white space are ' ', '\r', '\n', '\t'; its totally trivial.
If you want to allow [NEL]'s or deal with mandatory CRLF sequences, its not
really that hard. Iterating over a utf-8 string is both trivial and easy,
and you are easily put into a position where you can deal with sequences of
codepoints as well as sequences of bytes. And you dont have to decode to
do equality tests.


> > - Simple one-time-use programs which assume that characters are what you get
>   when you index strings (which break paragraphs or draw ASCII tables or count
>   occurrences of characters) are broken more often.

That type of logic is broken period. Personally, my opinion is that allowing
it does more harm than good. If you want to iterate over a string, dont use
single codepoint indexing. (For example, Spanish in NFD wouldnt work even in
UTF-32, because some letters would take two codepoints)


> No, this would be awful, because users won't remember to do the conversion or 
> won't to the right one. Half of the program would use UTF-8, half would use 
> the encoding of the file, and it will break when they exchange non-ASCII 
> data.

Conversely, only the user can do the right conversion.

> It makes no sense because it would require an Unicode-capable editor for 
> programs which use only ASCII. The source will have to be recodable at least 
> from UTF-8 and ISO-8859-x.

I agree: that being a strike against UTF-32, nes pa? If you cant even use it
for the language itself, then why force it internally? (If you really want
UTF-32, then open that can of dogfood :)

I think many people have yet to realize that the structure of unicode is
inhernently biased towards multi-byte encodings. Unicode is multi-codepoint
by design, and the conception that "encoding-unit == codepoint == character"
is fundamentally broken, and an illusion that may trap the unwary.

Going forward with a UTF-32 language as you describe could actually impair its
i18n abilities. A play-doh knife can be very convenient for newbie unicode
programmers, but theyll resent it when they need to carve a steak.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Strings in a programming language

Reply via email to