Re: [rust-dev] lib: regular expressions in std

Devin Jeanpierre Sat, 04 May 2013 17:50:32 -0700

On Sat, May 4, 2013 at 2:08 PM, Patrick Walton <[email protected]> wrote:
> On 5/4/13 6:54 AM, Devin Jeanpierre wrote:
>>
>>      A point brought up in #rust: if we use RE2 or similar, we may not
>>      be able to have a re!() syntax extension that compiles regexps at
>>      the same time as the surrounding rust code.
>
>
> I like the re!() extension idea, just for performance reasons. If you look
> at regexp-dna in the shootout, V8 is at the top, significantly ahead of C:
> this is because it JITs regexes. Regex performance is important for Web
> servers, because of routes.


Any reason it's important that this be a syntax extension? One could
easily imagine writing a JIT compiler using LLVM (at least, if Rust
supported calling C function pointers); however, I am not sure if std,
or even a regex module outside of std, is allowed to depend on LLVM.

> Note that, as far as I'm aware, RE2 has pretty bad performance in general
> (see Go's performance in regexp-dna). I'm actually personally somewhat
> skeptical of the practical advantages of the Thompson NFA algorithm compared
> to recursive backtracking: it makes pathological regexes much faster, but at
> the cost of significant overhead for non-pathological regexes (which are 99%
> of regexes used in reality).

FWIW, there are techniques to get around this:

- One can classify a family of regexps for which backtracking is very
  fast, and special-case them.

  RE2 special-cases the regexps for which backtracking never produces
  a match, and can run these faster than a backtracking engine, since it
  doesn't waste time backtracking uselessly.

- It is possible in general to remove the overhead of NFA, by
  compiling to DFA instead of emulating the NFA.

  Caveats:

    * The algorithm for doing this for tagged NFA (NFA that keep track
      of submatch extraction) is not well studied. Ville Laurikari wrote a
      paper on it, but the paper has some errors, and  there is no public
      implementation of this algorithm. (I meant to implement it before,
      but haven't gotten around to it yet.)

    * I actually am not sure if his paper describes POSIX style matching
      or PCRE style matching. However, it should be possible to make it
      use either one, since the principle is the same.

    * DFA can consume exponential space in the size of the NFA, which
      itself consumes exponential space in the size of the regexp. There are
      many cases of regexps that are reasonable enough but can't be sensibly
      represented as a DFA.

      This is the reason why the algorithm is not well studied -- most people
      (in relevant academic circles, at least) can't use it as a result of the
      state explosion.

- There are approaches that attempt to get "most of the time"
  performance like DFA, but without the huge memory cost. These are
  pretty cutting edge though, and therefore also not well studied.

    * http://link.springer.com/chapter/10.1007%2F978-3-642-15512-3_4
        (without submatch extraction)
    * http://www.hpl.hp.com/techreports/2012/HPL-2012-215.pdf
        (with)

  I don't know how it /really/ compares performance wise with NFA and
  backtracking implementations (the paper authors cannot be trusted to give
  reliable performance data ;). What I can do is implement this and some
  benchmarks, and then decide based on that how significant a technique
  it is.

  I am slightly skeptical of this technique compared to Thompson NFA,
  because OBDDs don't have great worst-case performance (although
  it still beats exponential time). However, that might be an advantage,
  considering your concern.
  People use OBDDs for the performance they actually give, rather
  than the performance they give in the worst case. They're a very popular
  data structure in certain circles these days because they are usually
  very fast.

  I expect that the kinds of regexes that OBDD based regex would
  perform well on would be unrelated (neither a subset nor superset) to
  the kinds of regex backtracking search performs well on, so this
  probably isn't a perfect solution. However, I haven't thought about it much.

> To me the hybrid approach is interesting: try recursive backtracking first,
> and if it takes too long fall back to the Thompson NFA, to avoid the DoS
> problem.

I'm personally biased against this, since for a naively written regex,
Thompson NFA is a far more reliable method. How about defaulting to
Thompson NFA, but letting people that know the performance
characteristics of their regex explicitly use backtracking search?

Going with backtracking search is not without other downsides. You
lose the ability to have POSIX-style regexp matches, which are useful
in some circumstances (especially lexing).

-- Devin
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Re: [rust-dev] lib: regular expressions in std

Reply via email to