On Sat, May 04, 2013 at 08:49:48PM -0400, Devin Jeanpierre wrote:
> - It is possible in general to remove the overhead of NFA, by
>   compiling to DFA instead of emulating the NFA.
> 
>   Caveats:
> 
>     * The algorithm for doing this for tagged NFA (NFA that keep track
>       of submatch extraction) is not well studied. Ville Laurikari wrote a
>       paper on it, but the paper has some errors, and  there is no public
>       implementation of this algorithm. (I meant to implement it before,
>       but haven't gotten around to it yet.)

Possibly also of interest: ocamllex.  I believe it works by generating
some kind of DFA, and it supports submatches.  I don't know if there's a
good description of its algorithm.

I've also wondered about trying to apply Brzozowski derivatives to
regular expressions with captures (or markers more generally, maybe); I
didn't get very far, but there might be something interesting there.

The other thing about ocamllex is that some of its expressiveness
comes from lexers being able to call each other recursively, including
tail-recursively, passing arbitrary arguments.  This is more powerful
than a "run this code then go to state X" feature in that the state can
contain arbitrary values, and seems hard to do as elegantly in Rust in
the absence of tail calls.

> - There are approaches that attempt to get "most of the time"
>   performance like DFA, but without the huge memory cost. These are
>   pretty cutting edge though, and therefore also not well studied.
> 
>     * http://link.springer.com/chapter/10.1007%2F978-3-642-15512-3_4
>         (without submatch extraction)
>     * http://www.hpl.hp.com/techreports/2012/HPL-2012-215.pdf
>         (with)

So... they use BDDs to represent the state set and the transition
relation, and step the machine with existentials.

>   I don't know how it /really/ compares performance wise with NFA and
>   backtracking implementations (the paper authors cannot be trusted to give
>   reliable performance data ;). What I can do is implement this and some
>   benchmarks, and then decide based on that how significant a technique
>   it is.

My entirely uninformed guess is that BDDs will work best (compared to
other approaches) when the NFA state set would be large -- e.g., for
an IDS, where the application is matching a lot of (probably somewhat
overlapping) patterns at once.  For a more focused search, or for
something like a programming language lexer (potentially lots of
patterns, but many of them small and trivially disjoint), the results
might be different.

>   I am slightly skeptical of this technique compared to Thompson NFA,
>   because OBDDs don't have great worst-case performance (although
>   it still beats exponential time).

Yes, this seemed to be a conspicuous absence from the paper -- if the
NFA has reachable state sets that aren't efficiently representable as
BDDs, it may be possible to craft input strings that reach them.  It's
still linear in the input size, but if it changes the coefficient enough
to exhaust the IDS's CPU capacity (or RAM), then that's a problem.

--Jed

_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to