Given how you frame the problem, then the hash lookup isn't even an option! No question, 6000+ string searches will be slow vs. a trie. Given the varying requirements we all encounter, day-to-day, I think this is an interesting exercise. Thanks for sharing these modules, Ted.

The OP indicated that the text can be tokenized:
KS> Unfortunately, my names can be embedded in larger "words" of the input
KS> text, as long as they are delimited by certain punctuation.

Greg L. pointed out some tools to assist with that. I was just pointing out it is worth his while to make that work. If input is line oriented, a "real" parser is unlikely to increase the cost by too much. But it might be more complicated than a simple split() expression.

To get a comparison, I tried Regexp::Trie on my Moby Dick dictionary and text and it definitely gets the job done in a reasonable time: 22:30 (vs 09:17 for split+hash). Good to know.

Despite it's better searching performance, Text::Match::FastAlternatives will not be usable, since it won't help you mangle the output. For regex to actually work, you need to use it like so: s/($rtr)/prefix_$1/g.

At 10:02 PM 2/5/2011 -0600, Ted Zlatanov wrote:
On Sat, 05 Feb 2011 18:27:13 -0500 Charlie <[email protected]> wrote:

C> The sample program below runs in 00:09:04 on 1.15GB  (1024 copies of
C> Moby Dick).   Replacing the hard-coded map with 2 entries with 6000
C> words taken from the text (randomly selected, unique, >5 chars)  runs
C> in 00:09:17.   I.e. the map lookup is trivial and has little impact on
C> the overall performance.   A trie might do just a bit better than a
C> hash table, but you are optimizing the wrong part of the program.

C> my %keywords = ( Foo => 1, Bar => 2 );

C> while ( <> )
C> {
C>    my $line = '';
C>    my @tokens = split /\s/;
C>    for my $tkn ( @tokens )
C>    {
C>        if ( '' ne $line ) {
C>          $line .= ' ';
C>        }
C>        if ( exists $keywords{$tkn} ) {
C>          $line .= 'prefix_';
C>        }
C>        $line .= $tkn;
C>    }
C>    print $line, "\n";
C> }

You assumed that \s will delimit the tokens.  That's not the case (see
the original message, the interesting data can occur anywhere).  So you
can't tokenize and do a simple hash lookup.  If you benchmark 6000
index() calls vs. 2 index() calls you'll see why a trie makes sense.
You may benefit if you tokenize, which I mentioned in my message:

TZ> If you know a boundary (e.g. newline or space or \b or \W) which
TZ> always demarcates your strings, you can split your input on that and
TZ> feed the pieces to Tree::Trie.

There is often a significant benefit to optimizing this kind of work in
C, as Text::Match::FastAlternatives does, because you may be able to
keep the code in the tight inner loop from busting the instruction
cache.  The performance benefits are huge.

I attach a benchmark of trie vs. fastmatch.  I commented out the index()
benchmark because it was ridiculously slow, on the order of 23 per
second on a fairly recent CPU.  Text::Match::FastAlternatives seems the
best choice, as advertised.

Ted

#!/usr/bin/perl

use warnings;
use strict;
use Data::Dumper;
use Benchmark qw/cmpthese/;
use Text::Match::FastAlternatives;
use Regexp::Trie;

my $text = 'abcd ' x 100;

my @words = 'a' .. 'hzz';

my $fast = Text::Match::FastAlternatives->new(@words);

my $rt = Regexp::Trie->new;
$rt->add($_) foreach @words;
my $rtr = $rt->regexp();
my $word_count = scalar @words;

cmpthese(5000000,
         {
          # "$word_count keywords, index()" => sub
          # {
          #  my $found;
          #  foreach (@words)
          #  {
          #   $found += index $text, $_, 0;
          #  }
          # },

          "$word_count keywords, fastmatch" => sub
          {
           return $fast->match($text)
          },

          "$word_count keywords, trie" => sub
          {
           return ($text =~ m/$rtr/);
          },
         });
__DATA__
Rate 6110 keywords, trie 6110 keywords, fastmatch 6110 keywords, trie 638570/s -- -73% 6110 keywords, fastmatch 2325581/s 264% --

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm


_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to