Re: (no subject)

Steve Dawson Tue, 04 Oct 2005 04:22:19 -0700

$Bill Luebkert wrote:

Foo Ji-Haw wrote:

I think the tricky part is that spaces may appear within the "..." field. In
which case the pattern may well be:
/^"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"(
[^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+$/)



You can just adjust the split:

foreach (@lines) {
        my @flds = split /"\s+"|^"|"$/;     # remove "s - column0 will now be 
empty
        if ($flds[6] eq 'pink' and $flds[8] eq 'blue') {
                print "found one: $_\n";
        }
}

While Bill and I both suggested the split (me because I thought it wasfaster and more clearly expressed), I was bored and decided to work outjust how much slower the regex would be. Having now benchmarked 8variations. I have to say "I was wrong".

On a 6MB test file, The regex ran in 2.48secs while my split ran in6.19secs. Here's the variations I tried and the timings...

1) My version of the split: *6.19s*@columns = split /\"\s+\"/

2) Bill's most recent version: *9.34s*@columns = split /"\s+"|^"|"$/

3) Ji-Haw's original regex: *2.48s*/^"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"/



4) Ji-Haw's with a stored pattern to make it vaguely more readable: *3.2s*
my $field = '"([^"]+)"';
/^$field\s+$field\s+$field\s+$field\s+$field\s+$field\s+$field\s+$field\s+$field\s+$field\s+$field\s+$field/;

5) With //o which didn't make any difference on 6MB but ran about 30%faster on a 270MB test file. I'm not sure why based on a previous listdiscussion.



6) A global pattern match which was more readable:  *4.95s*
my $field = '"([^"]+)"';
my($pink,$blue) = (/$field/g)[5,7];

7) To try to increase the speed of the matches, I decided to weed outlines not containing pink or blue: *1.2s*

next unless /blue/;
next unless /pink/;
/^"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"/o;


8) Encouraged by that effort I tried the previous in a single match: *1.11s*
/^"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"(pink)"\s+"[^"]+"\s+"(blue)"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"/

Conclusions:

A) I have too much time on my hands. The benchmarks on a 270MB file weretaking 5 mins per pass. The 6MB performance was almost always proportional.

B) regex appears to be significantly faster than split.

C) I believe the split is the most readable and for trivial to mediumtasks I'd still use it so the next person can see what I've done.D) I like version 6, the global regex because it's short, clean and afair compromise on performanceE) If I was running through a 10GB file, I'd be running some variationon v8 and using fast filtering to eliminate the bulk of the file.F) Bearing in mind that my test file only had blue|pink on 1% of it'sentries, your mileage may vary.



my $field = '"([^"]+)"';
for (<FILE>){

if(/^"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"pink"\s+"[^"]+"\s+"blue"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"/){

       #do stuff here...
   }
}
_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: (no subject)

Reply via email to