Re: [Boston.pm] Q: giant-but-simple regex efficiency

Charlie Sun, 06 Feb 2011 08:50:54 -0800

Given how you frame the problem, then the hash lookup isn't even anoption! No question, 6000+ string searches will be slow vs. atrie. Given the varying requirements we all encounter, day-to-day, Ithink this is an interesting exercise. Thanks for sharing these modules, Ted.


The OP indicated that the text can be tokenized:
KS> Unfortunately, my names can be embedded in larger "words" of the input
KS> text, as long as they are delimited by certain punctuation.

Greg L. pointed out some tools to assist with that. I was just pointingout it is worth his while to make that work. If input is line oriented, a"real" parser is unlikely to increase the cost by too much. But it mightbe more complicated than a simple split() expression.

To get a comparison, I tried Regexp::Trie on my Moby Dick dictionary andtext and it definitely gets the job done in a reasonable time: 22:30 (vs09:17 for split+hash). Good to know.

Despite it's better searching performance, Text::Match::FastAlternativeswill not be usable, since it won't help you mangle the output. For regexto actually work, you need to use it like so: s/($rtr)/prefix_$1/g.


At 10:02 PM 2/5/2011 -0600, Ted Zlatanov wrote:

On Sat, 05 Feb 2011 18:27:13 -0500 Charlie <[email protected]> wrote:

C> The sample program below runs in 00:09:04 on 1.15GB  (1024 copies of
C> Moby Dick).   Replacing the hard-coded map with 2 entries with 6000
C> words taken from the text (randomly selected, unique, >5 chars)  runs
C> in 00:09:17.   I.e. the map lookup is trivial and has little impact on
C> the overall performance.   A trie might do just a bit better than a
C> hash table, but you are optimizing the wrong part of the program.

C> my %keywords = ( Foo => 1, Bar => 2 );

C> while ( <> )
C> {
C>    my $line = '';
C>    my @tokens = split /\s/;
C>    for my $tkn ( @tokens )
C>    {
C>        if ( '' ne $line ) {
C>          $line .= ' ';
C>        }
C>        if ( exists $keywords{$tkn} ) {
C>          $line .= 'prefix_';
C>        }
C>        $line .= $tkn;
C>    }
C>    print $line, "\n";
C> }

You assumed that \s will delimit the tokens.  That's not the case (see
the original message, the interesting data can occur anywhere).  So you
can't tokenize and do a simple hash lookup.  If you benchmark 6000
index() calls vs. 2 index() calls you'll see why a trie makes sense.
You may benefit if you tokenize, which I mentioned in my message:

TZ> If you know a boundary (e.g. newline or space or \b or \W) which
TZ> always demarcates your strings, you can split your input on that and
TZ> feed the pieces to Tree::Trie.

There is often a significant benefit to optimizing this kind of work in
C, as Text::Match::FastAlternatives does, because you may be able to
keep the code in the tight inner loop from busting the instruction
cache.  The performance benefits are huge.

I attach a benchmark of trie vs. fastmatch.  I commented out the index()
benchmark because it was ridiculously slow, on the order of 23 per
second on a fairly recent CPU.  Text::Match::FastAlternatives seems the
best choice, as advertised.

Ted

#!/usr/bin/perl

use warnings;
use strict;
use Data::Dumper;
use Benchmark qw/cmpthese/;
use Text::Match::FastAlternatives;
use Regexp::Trie;

my $text = 'abcd ' x 100;

my @words = 'a' .. 'hzz';

my $fast = Text::Match::FastAlternatives->new(@words);

my $rt = Regexp::Trie->new;
$rt->add($_) foreach @words;
my $rtr = $rt->regexp();
my $word_count = scalar @words;

cmpthese(5000000,
         {
          # "$word_count keywords, index()" => sub
          # {
          #  my $found;
          #  foreach (@words)
          #  {
          #   $found += index $text, $_, 0;
          #  }
          # },

          "$word_count keywords, fastmatch" => sub
          {
           return $fast->match($text)
          },

          "$word_count keywords, trie" => sub
          {
           return ($text =~ m/$rtr/);
          },
         });
__DATA__

Rate 6110 keywords, trie 6110 keywords,fastmatch6110 keywords,trie 638570/s -- -73%6110 keywords, fastmatch2325581/s 264% --


_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm



_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] Q: giant-but-simple regex efficiency

Reply via email to