>> Due to the issue with brackets, however we'll have to live >> either with dfa.c being only 99% correct, or with it punting to >> glibc's regex matcher. > > I think dfa needs to be made smart enough to know when it must > punt to regex.
It already knows. However, for [a-x], [= =] and [. .] currently it prefers a nonexact result to a punting. It only punts for backreferences. > In fact, I think we should assume that dfa lives in a "regex is also > around" environment (true of grep and gawk, probably for gettext too), > and therefore the interface should be broadened to include a "don't > trust me, call regex yourself" return code. It's already there, again. >> Finding a balance is hard also because on >> non-glibc platform we then lose speed _but not gain precision_. > > I don't understand this. All of gawk, grep, and gettext should include > a recent version of regex for use on non-GLIBC systems anyway. So > in practice, regex is always available. Regex gets bracket expressions right on glibc systems only. On non-glibc systems only, regcomp/regexec might get them right, but regex will have exactly the same fallback as current dfa.c. > I think I have an obligation at this point to mention: > > http://swtch.com/~rsc/regexp/ > > In particular, there is code there for an "Efficient (non-backtracking) > NFA implementation with submatch tracking. Accepts UTF-8 and > wide-character Unicode input. Traditional egrep syntax only. Written by > Rob Pike." > > Perhaps this can serve as the basis for a new unified matcher? I think this is very much related to the algorithms already in use by regex. A unified matcher will always be slower than DFA. Paolo
