Re: [patch #6899] Speed-up for searching in multibyte and ignore-icase.

Paolo Bonzini Wed, 10 Mar 2010 12:27:55 -0800

>> Due to the issue with brackets, however we'll have to live
>> either with dfa.c being only 99% correct, or with it punting to
>> glibc's regex matcher.
>
> I think dfa needs to be made smart enough to know when it must
> punt to regex.


It already knows.  However, for [a-x], [= =] and [. .] currently it
prefers a nonexact result to a punting.  It only punts for
backreferences.

> In fact, I think we should assume that dfa lives in a "regex is also
> around" environment (true of grep and gawk, probably for gettext too),
> and therefore the interface should be broadened to include a "don't
> trust me, call regex yourself" return code.

It's already there, again.

>> Finding a balance is hard also because on
>> non-glibc platform we then lose speed _but not gain precision_.
>
> I don't understand this.  All of gawk, grep, and gettext should include
> a recent version of regex for use on non-GLIBC systems anyway. So
> in practice, regex is always available.

Regex gets bracket expressions right on glibc systems only.  On
non-glibc systems only, regcomp/regexec might get them right, but
regex will have exactly the same fallback as current dfa.c.

> I think I have an obligation at this point to mention:
>
>        http://swtch.com/~rsc/regexp/
>
> In particular, there is code there for an "Efficient (non-backtracking)
> NFA implementation with submatch tracking. Accepts UTF-8 and
> wide-character Unicode input. Traditional egrep syntax only. Written by
> Rob Pike."
>
> Perhaps this can serve as the basis for a new unified matcher?

I think this is very much related to the algorithms already in use by
regex.  A unified matcher will always be slower than DFA.

Paolo

Re: [patch #6899] Speed-up for searching in multibyte and ignore-icase.

Reply via email to