[Boston.pm] Q: giant-but-simple regex efficiency

Kripa Sundar Fri, 04 Feb 2011 15:44:29 -0800

Hi folks,

Problem:
I have a 900 Meg text file, containing random text.  I also have a list
of 6000 names (alphanumeric strings) that occur in the random text.
I need to tag a prefix on to each occurrence of each of these 6000
names.


My premise:
I believe a regex would give the simplest and most efficient algorithm.
If I am mistaken, I would be happy to learn.

Solution attempt:
I built a large-but-simple regex, consisting of all the names in
alternation.  I applied this regex to each input line.

My code:

  1: my @names = [...];  # my 6000 names.
  2: my $regex = join "|", @names;
  3: $regex = qr/\b($regex)\b/;
  4: 
  5: # Read the input, and write out to all the copies simultaneously.
  6: while (<>) {
  7:     s/$regex/prefix_$1/g;
  8: }

Turnaround time:
My seat-of-the-pants guess was that my code would run for 4-5 hours,
on a 2.4GHz AMD Opteron CPU.

But I found that I was pushing through less than 1% of the input per
hour.  So, my full run would have taken >100 hours.

I saw this poor throughput.  I thought sorting the names would help
the Perl regex compiler produce more efficient code. 
So I changed line 2 to:

  2: my $regex = join "|", sort @names;

That was a tiny fraction faster, but I still estimate that my run would
have taken 100 hours or more.

Is there a simple efficient solution that I am overlooking?
Is there any obvious inefficiency in my approach?

peace,          || Finding gifts that do not harm:
--{kr.pA}       || http://www.dailygood.org/more.php?n=3159
-- 
It might look like I'm idle, but at the cellular level I'm really quite busy.

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

[Boston.pm] Q: giant-but-simple regex efficiency

Reply via email to