if the alternation were implemented as a tree lookup then looking for a big list of similar alternations (last names in a phone book) would scale.

hm. I think I just found something to put on my to-do list for parse gnaw. should be fairly easy. pass in a list of identifiers, convert it to a tree search, return the coderef.

Is there a name for this kind of behavior? wondering what i would call the subroutine.

Greg

-----Original message-----
From: Charlie <[email protected]>
To: [email protected]
Sent: Sat, Feb 5, 2011 23:27:13 GMT+00:00
Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency

Short answer, no, Perl regex will not build an optimal lookup of a token into your set of 6000 names. In general, if speed is the issue, do not use regex. It does not scale.

Also, be clear on the 2 problems at hand: 1) tokenizing 1GB of input text and 2) adding a prefix to identified "keywords".

Tokenizing, reconstructing and printing 1GB of text should be measured in minutes, not hours. Doing something with each token may take a bit longer. But the problem is a simple keyword lookup. Nothing more. Building a simple hash keyed by your keywords would be sufficient.

The sample program below runs in 00:09:04 on 1.15GB (1024 copies of Moby Dick). Replacing the hard-coded map with 2 entries with 6000 words taken from the text (randomly selected, unique, >5 chars) runs in 00:09:17. I.e. the map lookup is trivial and has little impact on the overall performance. A trie might do just a bit better than a hash table, but you are optimizing the wrong part of the program.

my %keywords = ( Foo => 1, Bar => 2 );

while ( <> )
{
   my $line = '';
   my @tokens = split /\s/;
   for my $tkn ( @tokens )
   {
       if ( '' ne $line ) {
         $line .= ' ';
       }
       if ( exists $keywords{$tkn} ) {
         $line .= 'prefix_';
       }
       $line .= $tkn;
   }
   print $line, "\n";
}


At 06:10 PM 2/4/2011 -0600, you wrote:
On Fri, 4 Feb 2011 18:43:57 -0500 Kripa Sundar <[email protected]> wrote:

KS> I have a 900 Meg text file, containing random text.  I also have a list
KS> of 6000 names (alphanumeric strings) that occur in the random text.
KS> I need to tag a prefix on to each occurrence of each of these 6000
KS> names.

KS> My premise:
KS> I believe a regex would give the simplest and most efficient algorithm.
KS> If I am mistaken, I would be happy to learn.

KS> Solution attempt:
KS> I built a large-but-simple regex, consisting of all the names in
KS> alternation.  I applied this regex to each input line.

KS> My code:

KS>   1: my @names = [...];  # my 6000 names.
KS>   2: my $regex = join "|", @names;
KS>   3: $regex = qr/\b($regex)\b/;
KS>   4:
KS>   5: # Read the input, and write out to all the copies simultaneously.
KS>   6: while (<>) {
KS>   7:     s/$regex/prefix_$1/g;
KS>   8: }

On CPAN you can find Regex::Trie which builds this regex in a more
optimized way
(http://search.cpan.org/~dankogai/Regexp-Trie-0.02/lib/Regexp/Trie.pm).
This is the most general solution.

There's also Tree::Trie
(http://search.cpan.org/~avif/Tree-Trie-1.5/Trie.pm) which is for
general text matching.  If you know a boundary (e.g. newline or space or
\b or \W) which always demarcates your strings, you can split your input
on that and feed the pieces to Tree::Trie.  You have to benchmark to
find out if that's faster that Regex::Trie (which claims to be built
for speed).  This is a more specific solution than Regex::Trie.

I also found
http://aaroncrane.co.uk/2008/05/text_match_fastalternatives/ which
benchmarks a C implementation called Text::Match::FastAlternatives to be
very fast.  Incidentally it claims that Perl 5.10 has specific
optimizations for your use case and that Text::Match::FastAlternatives
beats all the alternatives in speed.  Text::Match::FastAlternatives only
handles ASCII characters and is the most single-purpose solution of the
three.  So try it if it fits, it's also on CPAN :)

If you decide on one in particular, please let us know what you found.

Ted

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm


_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm


_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to