Re: [Boston.pm] Q: giant-but-simple regex efficiency

Greg London Sun, 06 Feb 2011 12:39:21 -0800

was checking out the documentation for parse::gnaw. I alread have a quotedalternation subroutine called qa. It takes a list of literals and creates analternation of all of them. I could simply change the code under the hoodto be a tree search and then looking for a gigantic list of very similarwords wouldnt cost any more than a short list.

I was one of the lucky few to buy the new Intel chip with the bug in the IOport. am waiting for the replacement desktop. Until then I dont have any wayto work on Parse::Gnaw to add this feature.

But I will put it on my to-do list. the more I think about this the more Isee a potential boost it woukd provide. especially for bio searches wheresomeone is looking for one of several long but similar chains of dna.

Lastly, folks keep bringing up tokenizing. Parse::Gnaw and Parse::RecDescentboth do not tokenize. Tokenizing is faster. but but not tokenizing givesyour grammers more flexibility I think.


-----Original message-----
From: Charlie <[email protected]>
To: Ted Zlatanov <[email protected]>
Cc: [email protected]
Sent: Sun, Feb 6, 2011 16:49:56 GMT+00:00
Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency

Given how you frame the problem, then the hash lookup isn't even an option!No question, 6000+ string searches will be slow vs. a trie. Given thevarying requirements we all encounter, day-to-day, I think this is aninteresting exercise. Thanks for sharing these modules, Ted.


The OP indicated that the text can be tokenized:
KS> Unfortunately, my names can be embedded in larger "words" of the input
KS> text, as long as they are delimited by certain punctuation.

Greg L. pointed out some tools to assist with that. I was just pointingout it is worth his while to make that work. If input is line oriented, a"real" parser is unlikely to increase the cost by too much. But it mightbe more complicated than a simple split() expression.

To get a comparison, I tried Regexp::Trie on my Moby Dick dictionary andtext and it definitely gets the job done in a reasonable time: 22:30 (vs09:17 for split+hash). Good to know.

Despite it's better searching performance, Text::Match::FastAlternativeswill not be usable, since it won't help you mangle the output. For regexto actually work, you need to use it like so: s/($rtr)/prefix_$1/g.


At 10:02 PM 2/5/2011 -0600, Ted Zlatanov wrote:

On Sat, 05 Feb 2011 18:27:13 -0500 Charlie <[email protected]> wrote:

C> The sample program below runs in 00:09:04 on 1.15GB  (1024 copies of
C> Moby Dick).   Replacing the hard-coded map with 2 entries with 6000
C> words taken from the text (randomly selected, unique, >5 chars)  runs
C> in 00:09:17.   I.e. the map lookup is trivial and has little impact on
C> the overall performance.   A trie might do just a bit better than a
C> hash table, but you are optimizing the wrong part of the program.

C> my %keywords = ( Foo => 1, Bar => 2 );

C> while ( <> )
C> {
C>    my $line = '';
C>    my @tokens = split /\s/;
C>    for my $tkn ( @tokens )
C>    {
C>        if ( '' ne $line ) {
C>          $line .= ' ';
C>        }
C>        if ( exists $keywords{$tkn} ) {
C>          $line .= 'prefix_';
C>        }
C>        $line .= $tkn;
C>    }
C>    print $line, "\n";
C> }

You assumed that \s will delimit the tokens.  That's not the case (see
the original message, the interesting data can occur anywhere).  So you
can't tokenize and do a simple hash lookup.  If you benchmark 6000
index() calls vs. 2 index() calls you'll see why a trie makes sense.
You may benefit if you tokenize, which I mentioned in my message:

TZ> If you know a boundary (e.g. newline or space or \b or \W) which
TZ> always demarcates your strings, you can split your input on that and
TZ> feed the pieces to Tree::Trie.

There is often a significant benefit to optimizing this kind of work in
C, as Text::Match::FastAlternatives does, because you may be able to
keep the code in the tight inner loop from busting the instruction
cache.  The performance benefits are huge.

I attach a benchmark of trie vs. fastmatch.  I commented out the index()
benchmark because it was ridiculously slow, on the order of 23 per
second on a fairly recent CPU.  Text::Match::FastAlternatives seems the
best choice, as advertised.

Ted

#!/usr/bin/perl

use warnings;
use strict;
use Data::Dumper;
use Benchmark qw/cmpthese/;
use Text::Match::FastAlternatives;
use Regexp::Trie;

my $text = 'abcd ' x 100;

my @words = 'a' .. 'hzz';

my $fast = Text::Match::FastAlternatives->new(@words);

my $rt = Regexp::Trie->new;
$rt->add($_) foreach @words;
my $rtr = $rt->regexp();
my $word_count = scalar @words;

cmpthese(5000000,
         {
          # "$word_count keywords, index()" => sub
          # {
          #  my $found;
          #  foreach (@words)
          #  {
          #   $found += index $text, $_, 0;
          #  }
          # },

          "$word_count keywords, fastmatch" => sub
          {
           return $fast->match($text)
          },

          "$word_count keywords, trie" => sub
          {
           return ($text =~ m/$rtr/);
          },
         });
__DATA__

Rate 6110 keywords, trie 6110 keywords, >

fastmatch

6110 keywords, >trie 638570/s --

-73%

6110 keywords, fastmatch >2325581/s 264%

--


_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm



_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm


_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] Q: giant-but-simple regex efficiency

Reply via email to