Re: [Chicken-hackers] [PATCH] Update irregex to 0.9.0

Alex Shinn Sun, 16 Sep 2012 22:01:35 -0700

Thanks Peter!

On Mon, Sep 17, 2012 at 4:46 AM, Peter Bex <[email protected]> wrote:
> Hi everyone,
>
> Because the version of master was bumped to 4.8.1, I decided to go ahead
> and update our core irregex to the latest version.  The attached patches
> (all 4 of them) synchronizes us with upstream 0.9.0 irregex.  This gives
> some performance improvements for submatches.
>
> Before 0.9, irregex would rely on backtracking whenever submatches
> occurred in the SRE.  The new version uses a clever algorithm by
> Ville Laurikari based on "tagged" NFAs (or tNFAs), which are compiled
> to DFAs which can perform submatch bookkeeping while processing the
> input, in one pass.  You can get more info by reading the thesis that
> introduced this: http://laurikari.net/ville/regex-submatch.pdf
> There's also a very short paper that describes the concepts:
> http://laurikari.net/ville/spire2000-tnfa.pdf
> But beware, the short paper uses slightly different notation,
> terminology, and algorithms(!)
>
> Irregex (and thus, Chicken) is one of the few implementations of
> Laurikari's algorithm.  The others that I've found were:
>
> - Ville's own "TRE" C library (http://laurikari.net/tre/).  He wrote this
>    as an industrial-strength regex lib after completing the prototype
>    for his thesis in some obscure C++ macro hackery.  This library is
>    included in NetBSD 6.0 and if I understand correctly there are plans
>    to replace Henry Spencer's regexp engine in libc with TRE.
> - A regex class for the Tango library for the "D" language
>    
> (http://www.dsource.org/projects/tango/docs/current/htmlsrc/tango.text.Regex.html,
>    http://www.dsource.org/projects/tango/docs/current/tango.text.Regex.html).
>    I used this implementation to gain a detailed working understanding of the
>    algorithm where the paper was too hand-wavy.
> - A tNFA Emacs library (http://www.emacswiki.org/emacs/TaggedNFA)
>    The heavy imperative coding style and (to me) strange naming
>    conventions made this useless to understand the algorithm.
>    Man, people will write anything in Emacs Lisp!
> - Haskell has a pluggable Regex engine, which also has a tNFA backend.
>    (http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa)
>    However, they've invented an own concept ("Orbits") to fit their type
>    system.  This boggled my mind, so I kind of left it at that.
>    Design notes about Orbits: http://www.haskell.org/haskellwiki/RegexpDesign
>
> One possibly interesting aspect of this algorithm is that, according
> to Laurikari's thesis, with some modifications it should be able to
> perform "approximate" or "fuzzy" matching.  NetBSD 6 now includes an
> "agrep" command which leverages TRE's ability to do this.  I was only
> interested in the performance improvements so, didn't look into this.
>
> I've done some quick benchmarks.  There's the old "Chicken vs Perl"
> chestnut: 
> http://lists.nongnu.org/archive/html/chicken-users/2011-09/msg00131.html
>
> Without patch:
> interpreted: csi -s href-extraction.scm  3.75s user 0.11s system 97% cpu 
> 3.962 total
> compiled (-O3): ./href-extraction  2.53s user 0.05s system 98% cpu 2.605 total
>
> With patch:
> interpreted: csi -s href-extraction.scm  2.01s user 0.08s system 98% cpu 
> 2.126 total
> compiled (-O3): ./href-extraction  1.34s user 0.08s system 98% cpu 1.434 total
>
> Unfortunately, Perl still takes our lunch:
> perl extract.pl  0.05s user 0.01s system 83% cpu 0.072 total
>
> I'm unsure what the bottleneck is here.  Warrants more investigation.


Note this is an apples to oranges comparison - Perl
is using an optimistic algorithm that does well in common
cases like this.  It will outperform even highly tuned C
implementations of NFA matching like TRE or Google's RE
here.  On the other hand, there are patterns for which
Perl takes exponential time.

> As another test, I ran the irregex alternative implementation of uri-generic.
> This is simply the output of running the following at the REPL after
> loading the compiled uri-generic.irregex.so:
> #;1> ,t (let loop ((i 0)) (unless (> i 10000) (uri-reference 
> "http://call-with-current-continuation.org/foo/bar?mooh#qux";) (loop (add1 
> i))))
>
> Without patch:
> 1.683s CPU time, 0.083s GC time (major), 600142 mutations, 26/4716 GCs 
> (major/minor)
> With patch:
> 1.313s CPU time, 0.073s GC time (major), 600143 mutations, 23/4720 GCs 
> (major/minor)
> This is very close to the "main" implementation, which is based on the
> matchable egg.  It converts input strings into lists of chars and matches
> on those:
> 1.268s CPU time, 0.038s GC time (major), 1320215 mutations, 15/5555 GCs 
> (major/minor)
>
> So it looks like this consistently improves performance for some
> real-world tasks.
>
> Attached you also find the output of the benchmark program in the
> upstream irregex distribution, with an unpatched and a patched
> Chicken (changed the code to (use irregex) instead of loading
> it from the source dir).  You can see that for some benchmarks
> the improvement is quite a lot larger than in the "real-world" ones.
>
> There's a trade-off though: Regex compilation time is up for most of
> these, for some *considerably* much.  I think this is an acceptable
> tradeoff, as long as the compilation time doesn't go through the roof.
> In earlier tests, before some tweaking, this was the case.
>
> Anyone using irregex in practice is welcome to try this patch and report
> back results, especially regarding compilation times.  Of course match
> times are useful to know as well.
>
> Perhaps we can bring down compilation times by making clever use of some
> more specializations.  Better representations of tNFA multi-states are
> also welcome.  Tweaking this representation and the algorithms that
> manipulate them is what influenced compilation times the most during
> the implementation of this stuff.
>
> Cheers,
> Peter
> --
> http://sjamaan.ath.cx
> --
> "The process of preparing programs for a digital computer
>  is especially attractive, not only because it can be economically
>  and scientifically rewarding, but also because it can be an aesthetic
>  experience much like composing poetry or music."
>                                                         -- Donald Knuth
>
> _______________________________________________
> Chicken-hackers mailing list
> [email protected]
> https://lists.nongnu.org/mailman/listinfo/chicken-hackers
>

_______________________________________________
Chicken-hackers mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/chicken-hackers

Re: [Chicken-hackers] [PATCH] Update irregex to 0.9.0

Reply via email to