was checking out the documentation for parse::gnaw. I alread have a quoted
alternation subroutine called qa. It takes a list of literals and creates an
alternation of all of them. I could simply change the code under the hood
to be a tree search and then looking for a gigantic list of very similar
words wouldnt cost any more than a short list.
I was one of the lucky few to buy the new Intel chip with the bug in the IO
port. am waiting for the replacement desktop. Until then I dont have any way
to work on Parse::Gnaw to add this feature.
But I will put it on my to-do list. the more I think about this the more I
see a potential boost it woukd provide. especially for bio searches where
someone is looking for one of several long but similar chains of dna.
Lastly, folks keep bringing up tokenizing. Parse::Gnaw and Parse::RecDescent
both do not tokenize. Tokenizing is faster. but but not tokenizing gives
your grammers more flexibility I think.
-----Original message-----
From: Charlie <[email protected]>
To: Ted Zlatanov <[email protected]>
Cc: [email protected]
Sent: Sun, Feb 6, 2011 16:49:56 GMT+00:00
Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency
Given how you frame the problem, then the hash lookup isn't even an option!
No question, 6000+ string searches will be slow vs. a trie. Given the
varying requirements we all encounter, day-to-day, I think this is an
interesting exercise. Thanks for sharing these modules, Ted.
The OP indicated that the text can be tokenized:
KS> Unfortunately, my names can be embedded in larger "words" of the input
KS> text, as long as they are delimited by certain punctuation.
Greg L. pointed out some tools to assist with that. I was just pointing
out it is worth his while to make that work. If input is line oriented, a
"real" parser is unlikely to increase the cost by too much. But it might
be more complicated than a simple split() expression.
To get a comparison, I tried Regexp::Trie on my Moby Dick dictionary and
text and it definitely gets the job done in a reasonable time: 22:30 (vs
09:17 for split+hash). Good to know.
Despite it's better searching performance, Text::Match::FastAlternatives
will not be usable, since it won't help you mangle the output. For regex
to actually work, you need to use it like so: s/($rtr)/prefix_$1/g.
At 10:02 PM 2/5/2011 -0600, Ted Zlatanov wrote:
On Sat, 05 Feb 2011 18:27:13 -0500 Charlie <[email protected]> wrote:
C> The sample program below runs in 00:09:04 on 1.15GB (1024 copies of
C> Moby Dick). Replacing the hard-coded map with 2 entries with 6000
C> words taken from the text (randomly selected, unique, >5 chars) runs
C> in 00:09:17. I.e. the map lookup is trivial and has little impact on
C> the overall performance. A trie might do just a bit better than a
C> hash table, but you are optimizing the wrong part of the program.
C> my %keywords = ( Foo => 1, Bar => 2 );
C> while ( <> )
C> {
C> my $line = '';
C> my @tokens = split /\s/;
C> for my $tkn ( @tokens )
C> {
C> if ( '' ne $line ) {
C> $line .= ' ';
C> }
C> if ( exists $keywords{$tkn} ) {
C> $line .= 'prefix_';
C> }
C> $line .= $tkn;
C> }
C> print $line, "\n";
C> }
You assumed that \s will delimit the tokens. That's not the case (see
the original message, the interesting data can occur anywhere). So you
can't tokenize and do a simple hash lookup. If you benchmark 6000
index() calls vs. 2 index() calls you'll see why a trie makes sense.
You may benefit if you tokenize, which I mentioned in my message:
TZ> If you know a boundary (e.g. newline or space or \b or \W) which
TZ> always demarcates your strings, you can split your input on that and
TZ> feed the pieces to Tree::Trie.
There is often a significant benefit to optimizing this kind of work in
C, as Text::Match::FastAlternatives does, because you may be able to
keep the code in the tight inner loop from busting the instruction
cache. The performance benefits are huge.
I attach a benchmark of trie vs. fastmatch. I commented out the index()
benchmark because it was ridiculously slow, on the order of 23 per
second on a fairly recent CPU. Text::Match::FastAlternatives seems the
best choice, as advertised.
Ted
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
use Benchmark qw/cmpthese/;
use Text::Match::FastAlternatives;
use Regexp::Trie;
my $text = 'abcd ' x 100;
my @words = 'a' .. 'hzz';
my $fast = Text::Match::FastAlternatives->new(@words);
my $rt = Regexp::Trie->new;
$rt->add($_) foreach @words;
my $rtr = $rt->regexp();
my $word_count = scalar @words;
cmpthese(5000000,
{
# "$word_count keywords, index()" => sub
# {
# my $found;
# foreach (@words)
# {
# $found += index $text, $_, 0;
# }
# },
"$word_count keywords, fastmatch" => sub
{
return $fast->match($text)
},
"$word_count keywords, trie" => sub
{
return ($text =~ m/$rtr/);
},
});
__DATA__
Rate 6110 keywords, trie 6110 keywords, >
fastmatch
6110 keywords, >trie 638570/s --
-73%
6110 keywords, fastmatch >2325581/s 264%
--
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm