Thanks Peter! On Mon, Sep 17, 2012 at 4:46 AM, Peter Bex <[email protected]> wrote: > Hi everyone, > > Because the version of master was bumped to 4.8.1, I decided to go ahead > and update our core irregex to the latest version. The attached patches > (all 4 of them) synchronizes us with upstream 0.9.0 irregex. This gives > some performance improvements for submatches. > > Before 0.9, irregex would rely on backtracking whenever submatches > occurred in the SRE. The new version uses a clever algorithm by > Ville Laurikari based on "tagged" NFAs (or tNFAs), which are compiled > to DFAs which can perform submatch bookkeeping while processing the > input, in one pass. You can get more info by reading the thesis that > introduced this: http://laurikari.net/ville/regex-submatch.pdf > There's also a very short paper that describes the concepts: > http://laurikari.net/ville/spire2000-tnfa.pdf > But beware, the short paper uses slightly different notation, > terminology, and algorithms(!) > > Irregex (and thus, Chicken) is one of the few implementations of > Laurikari's algorithm. The others that I've found were: > > - Ville's own "TRE" C library (http://laurikari.net/tre/). He wrote this > as an industrial-strength regex lib after completing the prototype > for his thesis in some obscure C++ macro hackery. This library is > included in NetBSD 6.0 and if I understand correctly there are plans > to replace Henry Spencer's regexp engine in libc with TRE. > - A regex class for the Tango library for the "D" language > > (http://www.dsource.org/projects/tango/docs/current/htmlsrc/tango.text.Regex.html, > http://www.dsource.org/projects/tango/docs/current/tango.text.Regex.html). > I used this implementation to gain a detailed working understanding of the > algorithm where the paper was too hand-wavy. > - A tNFA Emacs library (http://www.emacswiki.org/emacs/TaggedNFA) > The heavy imperative coding style and (to me) strange naming > conventions made this useless to understand the algorithm. > Man, people will write anything in Emacs Lisp! > - Haskell has a pluggable Regex engine, which also has a tNFA backend. > (http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa) > However, they've invented an own concept ("Orbits") to fit their type > system. This boggled my mind, so I kind of left it at that. > Design notes about Orbits: http://www.haskell.org/haskellwiki/RegexpDesign > > One possibly interesting aspect of this algorithm is that, according > to Laurikari's thesis, with some modifications it should be able to > perform "approximate" or "fuzzy" matching. NetBSD 6 now includes an > "agrep" command which leverages TRE's ability to do this. I was only > interested in the performance improvements so, didn't look into this. > > I've done some quick benchmarks. There's the old "Chicken vs Perl" > chestnut: > http://lists.nongnu.org/archive/html/chicken-users/2011-09/msg00131.html > > Without patch: > interpreted: csi -s href-extraction.scm 3.75s user 0.11s system 97% cpu > 3.962 total > compiled (-O3): ./href-extraction 2.53s user 0.05s system 98% cpu 2.605 total > > With patch: > interpreted: csi -s href-extraction.scm 2.01s user 0.08s system 98% cpu > 2.126 total > compiled (-O3): ./href-extraction 1.34s user 0.08s system 98% cpu 1.434 total > > Unfortunately, Perl still takes our lunch: > perl extract.pl 0.05s user 0.01s system 83% cpu 0.072 total > > I'm unsure what the bottleneck is here. Warrants more investigation.
Note this is an apples to oranges comparison - Perl is using an optimistic algorithm that does well in common cases like this. It will outperform even highly tuned C implementations of NFA matching like TRE or Google's RE here. On the other hand, there are patterns for which Perl takes exponential time. > As another test, I ran the irregex alternative implementation of uri-generic. > This is simply the output of running the following at the REPL after > loading the compiled uri-generic.irregex.so: > #;1> ,t (let loop ((i 0)) (unless (> i 10000) (uri-reference > "http://call-with-current-continuation.org/foo/bar?mooh#qux") (loop (add1 > i)))) > > Without patch: > 1.683s CPU time, 0.083s GC time (major), 600142 mutations, 26/4716 GCs > (major/minor) > With patch: > 1.313s CPU time, 0.073s GC time (major), 600143 mutations, 23/4720 GCs > (major/minor) > This is very close to the "main" implementation, which is based on the > matchable egg. It converts input strings into lists of chars and matches > on those: > 1.268s CPU time, 0.038s GC time (major), 1320215 mutations, 15/5555 GCs > (major/minor) > > So it looks like this consistently improves performance for some > real-world tasks. > > Attached you also find the output of the benchmark program in the > upstream irregex distribution, with an unpatched and a patched > Chicken (changed the code to (use irregex) instead of loading > it from the source dir). You can see that for some benchmarks > the improvement is quite a lot larger than in the "real-world" ones. > > There's a trade-off though: Regex compilation time is up for most of > these, for some *considerably* much. I think this is an acceptable > tradeoff, as long as the compilation time doesn't go through the roof. > In earlier tests, before some tweaking, this was the case. > > Anyone using irregex in practice is welcome to try this patch and report > back results, especially regarding compilation times. Of course match > times are useful to know as well. > > Perhaps we can bring down compilation times by making clever use of some > more specializations. Better representations of tNFA multi-states are > also welcome. Tweaking this representation and the algorithms that > manipulate them is what influenced compilation times the most during > the implementation of this stuff. > > Cheers, > Peter > -- > http://sjamaan.ath.cx > -- > "The process of preparing programs for a digital computer > is especially attractive, not only because it can be economically > and scientifically rewarding, but also because it can be an aesthetic > experience much like composing poetry or music." > -- Donald Knuth > > _______________________________________________ > Chicken-hackers mailing list > [email protected] > https://lists.nongnu.org/mailman/listinfo/chicken-hackers > _______________________________________________ Chicken-hackers mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/chicken-hackers
