On Fri, 4 Jul 2003, Marcin 'Qrczak' Kowalczyk wrote:
> > > ...With UTF-8 it's much worse...
> > Not actually by much, unless you're lucky enough to be able to decide that
> > a following accent never alters which class a character falls in. (What
> > if the user wants to write a "has no accents" predicate?)
>
> Then of course it's not easier, but in cases I encountered it is easier.
Depends on whether you want to accommodate just those cases, or a more
general problem.
> A language spec which allows non-ASCII identifiers should say that modifiers
> are allowed as part of identifiers, because it's the simplest to specify and
> it works.
Yes, this is reasonable, but it doesn't solve every such problem.
> > Sure you can. You look for any of a set of whitespace sequences. UTF-8
> > encoding is unambiguous...
>
> Looking for "one of given sequences" is not an instance of the problem
> "looking for a character satisfying the given predicate", so the API would
> have to be different.
True. On the other hand, it's not hard to have a predicate that examines
a sequence rather than a character, so the API doesn't have to be very
different if you really want predicates.
> Also specifying the stopping criterion in terms of
> sequences rather than predicates is more expensive when there are many
> matching characters.
No, it's more efficient. Knowing exactly which things can match, instead
of having to ask a predicate each time, almost always permits a better
implementation. Boyer-Moore techniques, in particular, are available only
if the desired sequences are exactly known in advance.
Henry Spencer
[EMAIL PROTECTED]
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/