On Thu, 3 Jul 2003, Marcin 'Qrczak' Kowalczyk wrote:
> - Character predicates must work on strings and it's not obvious what part of
>   the string to feed to them. For example in a compiler/interpreter with
>   UTF-32 you can write a predicate for "allowed as the first character in
>   identifier" and "allowed in the rest of identifier", and take as many
>   characters as satisfy the predicate. With UTF-8 it's much worse...

Not actually by much, unless you're lucky enough to be able to decide that
a following accent never alters which class a character falls in.  (What
if the user wants to write a "has no accents" predicate?)  If you want to
handle the general case, then in *either* UTF-32 or UTF-8, the predicates
*must* apply to sequences, not just to individual characters.  UTF-8 just
makes the sequences, on average, rather longer.

>   ...You can't even implement "split on
>   whitespace" without UTF-8 decoding, because you don't know what part of the
>   string to test whether it's a whitespace character.

Sure you can.  You look for any of a set of whitespace sequences.  UTF-8
encoding is unambiguous; the sequence for a character never occurs as part
of the sequence for a different character.  If you find the sequence for a
whitespace character, it is always that character.

Searching for sequences rather than individual characters is harder,
yes... but this is exactly the sort of messy implementation chore that a
language can usefully hide from the user. 

> - Simple one-time-use programs which assume that characters are what you get
>   when you index strings (which break paragraphs or draw ASCII tables or count
>   occurrences of characters) are broken more often.

Definitely true.  This strongly suggests providing more powerful tools, so
the user doesn't have to use indexing so often.  (Rob Pike once noted that
one of the most damning things about awk is how often you see substr()
used in a language which does include regexp support.)  Users can and will
use higher-level tools, if they are provided -- indexing is a pain! 

                                                          Henry Spencer
                                                       [EMAIL PROTECTED]

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to