On Fri, 4 Jul 2003, Marcin 'Qrczak' Kowalczyk wrote:
> > > ...With UTF-8 it's much worse...
> > Not actually by much, unless you're lucky enough to be able to decide that
> > a following accent never alters which class a character falls in.  (What
> > if the user wants to write a "has no accents" predicate?)
> 
> Then of course it's not easier, but in cases I encountered it is easier.

Depends on whether you want to accommodate just those cases, or a more
general problem.

> A language spec which allows non-ASCII identifiers should say that modifiers 
> are allowed as part of identifiers, because it's the simplest to specify and 
> it works.

Yes, this is reasonable, but it doesn't solve every such problem.

> > Sure you can.  You look for any of a set of whitespace sequences.  UTF-8
> > encoding is unambiguous...
> 
> Looking for "one of given sequences" is not an instance of the problem 
> "looking for a character satisfying the given predicate", so the API would 
> have to be different.

True.  On the other hand, it's not hard to have a predicate that examines
a sequence rather than a character, so the API doesn't have to be very
different if you really want predicates. 

> Also specifying the stopping criterion in terms of 
> sequences rather than predicates is more expensive when there are many 
> matching characters.

No, it's more efficient.  Knowing exactly which things can match, instead
of having to ask a predicate each time, almost always permits a better
implementation.  Boyer-Moore techniques, in particular, are available only
if the desired sequences are exactly known in advance.

                                                          Henry Spencer
                                                       [EMAIL PROTECTED]

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to