Dnia pią 4. lipca 2003 19:44, Henry Spencer napisał:

> > with UTF-32 you can write a predicate for "allowed as the first character
> > in identifier" and "allowed in the rest of identifier", and take as many
> > characters as satisfy the predicate. With UTF-8 it's much worse...
>
> Not actually by much, unless you're lucky enough to be able to decide that
> a following accent never alters which class a character falls in.  (What
> if the user wants to write a "has no accents" predicate?)

Then of course it's not easier, but in cases I encountered it is easier.
A language spec which allows non-ASCII identifiers should say that modifiers 
are allowed as part of identifiers, because it's the simplest to specify and 
it works.

> Sure you can.  You look for any of a set of whitespace sequences.  UTF-8
> encoding is unambiguous; the sequence for a character never occurs as part
> of the sequence for a different character.  If you find the sequence for a
> whitespace character, it is always that character.

Looking for "one of given sequences" is not an instance of the problem 
"looking for a character satisfying the given predicate", so the API would 
have to be different. Also specifying the stopping criterion in terms of 
sequences rather than predicates is more expensive when there are many 
matching characters.

So it would take away a simple way of specifying searching crietrion - 
character predicate, and it would grant nothing comparable - searching for 
one of given subsequences works in both cases, searching by a predicate works 
well only in UTF-32 (unless you decode UTF-8 on the fly). UTF-8 is never 
simpler than UTF-32 except when you interface with the outside world in 
UTF-8.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to