On Sat, May 04, 2013 at 08:49:48PM -0400, Devin Jeanpierre wrote: > - It is possible in general to remove the overhead of NFA, by > compiling to DFA instead of emulating the NFA. > > Caveats: > > * The algorithm for doing this for tagged NFA (NFA that keep track > of submatch extraction) is not well studied. Ville Laurikari wrote a > paper on it, but the paper has some errors, and there is no public > implementation of this algorithm. (I meant to implement it before, > but haven't gotten around to it yet.)
Possibly also of interest: ocamllex. I believe it works by generating some kind of DFA, and it supports submatches. I don't know if there's a good description of its algorithm. I've also wondered about trying to apply Brzozowski derivatives to regular expressions with captures (or markers more generally, maybe); I didn't get very far, but there might be something interesting there. The other thing about ocamllex is that some of its expressiveness comes from lexers being able to call each other recursively, including tail-recursively, passing arbitrary arguments. This is more powerful than a "run this code then go to state X" feature in that the state can contain arbitrary values, and seems hard to do as elegantly in Rust in the absence of tail calls. > - There are approaches that attempt to get "most of the time" > performance like DFA, but without the huge memory cost. These are > pretty cutting edge though, and therefore also not well studied. > > * http://link.springer.com/chapter/10.1007%2F978-3-642-15512-3_4 > (without submatch extraction) > * http://www.hpl.hp.com/techreports/2012/HPL-2012-215.pdf > (with) So... they use BDDs to represent the state set and the transition relation, and step the machine with existentials. > I don't know how it /really/ compares performance wise with NFA and > backtracking implementations (the paper authors cannot be trusted to give > reliable performance data ;). What I can do is implement this and some > benchmarks, and then decide based on that how significant a technique > it is. My entirely uninformed guess is that BDDs will work best (compared to other approaches) when the NFA state set would be large -- e.g., for an IDS, where the application is matching a lot of (probably somewhat overlapping) patterns at once. For a more focused search, or for something like a programming language lexer (potentially lots of patterns, but many of them small and trivially disjoint), the results might be different. > I am slightly skeptical of this technique compared to Thompson NFA, > because OBDDs don't have great worst-case performance (although > it still beats exponential time). Yes, this seemed to be a conspicuous absence from the paper -- if the NFA has reachable state sets that aren't efficiently representable as BDDs, it may be possible to craft input strings that reach them. It's still linear in the input size, but if it changes the coefficient enough to exhaust the IDS's CPU capacity (or RAM), then that's a problem. --Jed _______________________________________________ Rust-dev mailing list [email protected] https://mail.mozilla.org/listinfo/rust-dev
