Given how you frame the problem, then the hash lookup isn't even an
option! No question, 6000+ string searches will be slow vs. a
trie. Given the varying requirements we all encounter, day-to-day, I
think this is an interesting exercise. Thanks for sharing these modules, Ted.
The OP indicated that the text can be tokenized:
KS> Unfortunately, my names can be embedded in larger "words" of the input
KS> text, as long as they are delimited by certain punctuation.
Greg L. pointed out some tools to assist with that. I was just pointing
out it is worth his while to make that work. If input is line oriented, a
"real" parser is unlikely to increase the cost by too much. But it might
be more complicated than a simple split() expression.
To get a comparison, I tried Regexp::Trie on my Moby Dick dictionary and
text and it definitely gets the job done in a reasonable time: 22:30 (vs
09:17 for split+hash). Good to know.
Despite it's better searching performance, Text::Match::FastAlternatives
will not be usable, since it won't help you mangle the output. For regex
to actually work, you need to use it like so: s/($rtr)/prefix_$1/g.
At 10:02 PM 2/5/2011 -0600, Ted Zlatanov wrote:
On Sat, 05 Feb 2011 18:27:13 -0500 Charlie <[email protected]> wrote:
C> The sample program below runs in 00:09:04 on 1.15GB (1024 copies of
C> Moby Dick). Replacing the hard-coded map with 2 entries with 6000
C> words taken from the text (randomly selected, unique, >5 chars) runs
C> in 00:09:17. I.e. the map lookup is trivial and has little impact on
C> the overall performance. A trie might do just a bit better than a
C> hash table, but you are optimizing the wrong part of the program.
C> my %keywords = ( Foo => 1, Bar => 2 );
C> while ( <> )
C> {
C> my $line = '';
C> my @tokens = split /\s/;
C> for my $tkn ( @tokens )
C> {
C> if ( '' ne $line ) {
C> $line .= ' ';
C> }
C> if ( exists $keywords{$tkn} ) {
C> $line .= 'prefix_';
C> }
C> $line .= $tkn;
C> }
C> print $line, "\n";
C> }
You assumed that \s will delimit the tokens. That's not the case (see
the original message, the interesting data can occur anywhere). So you
can't tokenize and do a simple hash lookup. If you benchmark 6000
index() calls vs. 2 index() calls you'll see why a trie makes sense.
You may benefit if you tokenize, which I mentioned in my message:
TZ> If you know a boundary (e.g. newline or space or \b or \W) which
TZ> always demarcates your strings, you can split your input on that and
TZ> feed the pieces to Tree::Trie.
There is often a significant benefit to optimizing this kind of work in
C, as Text::Match::FastAlternatives does, because you may be able to
keep the code in the tight inner loop from busting the instruction
cache. The performance benefits are huge.
I attach a benchmark of trie vs. fastmatch. I commented out the index()
benchmark because it was ridiculously slow, on the order of 23 per
second on a fairly recent CPU. Text::Match::FastAlternatives seems the
best choice, as advertised.
Ted
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
use Benchmark qw/cmpthese/;
use Text::Match::FastAlternatives;
use Regexp::Trie;
my $text = 'abcd ' x 100;
my @words = 'a' .. 'hzz';
my $fast = Text::Match::FastAlternatives->new(@words);
my $rt = Regexp::Trie->new;
$rt->add($_) foreach @words;
my $rtr = $rt->regexp();
my $word_count = scalar @words;
cmpthese(5000000,
{
# "$word_count keywords, index()" => sub
# {
# my $found;
# foreach (@words)
# {
# $found += index $text, $_, 0;
# }
# },
"$word_count keywords, fastmatch" => sub
{
return $fast->match($text)
},
"$word_count keywords, trie" => sub
{
return ($text =~ m/$rtr/);
},
});
__DATA__
Rate 6110 keywords, trie 6110 keywords,
fastmatch
6110 keywords,
trie 638570/s -- -73%
6110 keywords, fastmatch
2325581/s 264% --
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm