Dnia pią 4. lipca 2003 19:44, Henry Spencer napisał:
> > with UTF-32 you can write a predicate for "allowed as the first character
> > in identifier" and "allowed in the rest of identifier", and take as many
> > characters as satisfy the predicate. With UTF-8 it's much worse...
>
> Not actually by much, unless you're lucky enough to be able to decide that
> a following accent never alters which class a character falls in. (What
> if the user wants to write a "has no accents" predicate?)
Then of course it's not easier, but in cases I encountered it is easier.
A language spec which allows non-ASCII identifiers should say that modifiers
are allowed as part of identifiers, because it's the simplest to specify and
it works.
> Sure you can. You look for any of a set of whitespace sequences. UTF-8
> encoding is unambiguous; the sequence for a character never occurs as part
> of the sequence for a different character. If you find the sequence for a
> whitespace character, it is always that character.
Looking for "one of given sequences" is not an instance of the problem
"looking for a character satisfying the given predicate", so the API would
have to be different. Also specifying the stopping criterion in terms of
sequences rather than predicates is more expensive when there are many
matching characters.
So it would take away a simple way of specifying searching crietrion -
character predicate, and it would grant nothing comparable - searching for
one of given subsequences works in both cases, searching by a predicate works
well only in UTF-32 (unless you decode UTF-8 on the fly). UTF-8 is never
simpler than UTF-32 except when you interface with the outside world in
UTF-8.
--
__("< Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/