$Bill Luebkert wrote:
Foo Ji-Haw wrote:
I think the tricky part is that spaces may appear within the "..." field. In
which case the pattern may well be:
/^"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"(
[^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+$/)


You can just adjust the split:

foreach (@lines) {
        my @flds = split /"\s+"|^"|"$/;     # remove "s - column0 will now be 
empty
        if ($flds[6] eq 'pink' and $flds[8] eq 'blue') {
                print "found one: $_\n";
        }
}

While Bill and I both suggested the split (me because I thought it was faster and more clearly expressed), I was bored and decided to work out just how much slower the regex would be. Having now benchmarked 8 variations. I have to say "I was wrong".

On a 6MB test file, The regex ran in 2.48secs while my split ran in 6.19secs. Here's the variations I tried and the timings...

1) My version of the split: *6.19s* @columns = split /\"\s+\"/


2) Bill's most recent version: *9.34s* @columns = split /"\s+"|^"|"$/


3) Ji-Haw's original regex: *2.48s* /^"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"/


4) Ji-Haw's with a stored pattern to make it vaguely more readable: *3.2s*
my $field = '"([^"]+)"';
/^$field\s+$field\s+$field\s+$field\s+$field\s+$field\s+$field\s+$field\s+$field\s+$field\s+$field\s+$field/;


5) With //o which didn't make any difference on 6MB but ran about 30% faster on a 270MB test file. I'm not sure why based on a previous list discussion.


6) A global pattern match which was more readable:  *4.95s*
my $field = '"([^"]+)"';
my($pink,$blue) = (/$field/g)[5,7];


7) To try to increase the speed of the matches, I decided to weed out lines not containing pink or blue: *1.2s*
next unless /blue/;
next unless /pink/;
/^"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"\s+"([^"]+)"/o;


8) Encouraged by that effort I tried the previous in a single match: *1.11s*
/^"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"(pink)"\s+"[^"]+"\s+"(blue)"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"/

Conclusions:
A) I have too much time on my hands. The benchmarks on a 270MB file were taking 5 mins per pass. The 6MB performance was almost always proportional.
B) regex appears to be significantly faster than split.
C) I believe the split is the most readable and for trivial to medium tasks I'd still use it so the next person can see what I've done. D) I like version 6, the global regex because it's short, clean and a fair compromise on performance E) If I was running through a 10GB file, I'd be running some variation on v8 and using fast filtering to eliminate the bulk of the file. F) Bearing in mind that my test file only had blue|pink on 1% of it's entries, your mileage may vary.


my $field = '"([^"]+)"';
for (<FILE>){
if(/^"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"pink"\s+"[^"]+"\s+"blue"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"\s+"[^"]+"/){
       #do stuff here...
   }
}
_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to