Summary: The easy changes made it about 1% faster.  The diea of using slices
is appealing, and might work, but I did not find the right syntax to get it
to work.

 

Details: Thanks to everyone's advice.  I made many of the suggested changes,
e.g., declared my @data outside the loop, used $. rather than my $numlines,
and two changes that could have materially improved performance.   Moved the
regex from the inner loop by

  my $split_pat=',';  # CHANGE ME IF NEEDED (later use getopts)

  my $re = qr/$split_pat/o;

  ...

  @data = split($re, $_);

 

and also, as Bill suggested, changed from:

   for (my $i = 0; $i < @data; $i++) { $aoh[$i]{$data[$i]}++; } # original

to:

   for (0..$#data) { $aoh[$_]{$data[$_]}++; } 

 

Taken together all these changes made it run a little more than 1% faster on
one test file on my PC that takes about 7 seconds to process.  (The real
files can take > 1 minute and I do not have access to the machine now.)  I
measured by adding

  use Time::HiRes qw( gettimeofday tv_interval );

 

In the original email I said

# For each field we increment the count for the data value in that field. We

# currently do not actually uses those counts, so we could have just said

# anything on the RHS e.g. =1.  But the counts might come in handy later,

 

Since I can just set a constant value I wanted to try Bill's idea (see
below) of using slices, something like this.  

  $aoh[0..$#data]{@data} = 1;  

But this did not work.  Something like this ought to work since the number
of hashes will always equal the number of data elements.  But I cannot get
the sigils right.           

 

Is there a way to do the above?

 

 

-- 

Thanks,

Steve Tolkin

 

From: Bill Ricker [mailto:[email protected]] 
Sent: Saturday, March 08, 2014 2:30 PM
To: Steve Tolkin
Cc: Boston Perl Mongers
Subject: Re: [Boston.pm] perl program to count distinct values - can it be
made faster

 

[much snipped]

4A. Normally i'd use

Heuristic: convert the hash-array update loop to array/hash slice [
www.stonehenge.com/merlyn/UnixReview/col68.html  ]

We'd like 0 .. $NF range to replace $i and @data replace $data[$i] for
parallel operation.

(again, Implicit code is tight C not interpreted.)

But i don't immediately see a way to make this loop

   for (my $i = 0; $i < scalar @data; $i++) {  $aoh[$i]{$data[$i]}++; }

fully implicit with slices the way i'd do for typical CSV unpacking like

  @HofFields{@AofFieldNames} = @splitdata; # hash slice

since auto increment is intrinsically scalar oriented. AFAIK i can't ++ a
slice. 

 

4B. We can at least save inner loop re-indexing time.

 

Heuristic: Whenever the C/Fortran for(;;) loop can be replaced with
foreach() or while(), do so.

( unless replaceable by Slice (4A above), map, or other parallel ops.)

 

So

     $aoh[$_]{$data[$_]}++ foreach 0 .. $#data; 

which will be faster since interpreted $i  twiddling is replaced by implicit
loop control.

 

foreach (0..$#blah) { } should be equally fast as statement modifier form.

Confusingly, foreach() is may be abbreviated for().

 

4C While in general for readability and safety,

    foreach my $niceScopedName (..){block} 

is preferred to (re)using implicit $_ 

-- although it is safe even under while(<>) if you don't refer to both at
once ! --  

it is plausible it might be slower than pushing a $_ context, so i showed
the $_ form in 4B.

 

You could bench it, or just just follow the heuristic for raw speed:

Always prefer the built-in, it's more likely optimized.

 

4D Using map{} or worse grep{} as synonym of foreach just for the
side-effects would be evil 

   map { $aoh[$_]{$data[$_]}++ }  ( 0 .. @#data ) ; 

and shouldn't be any faster than the foreach(). 

No benefit to being cryptic here, so don't.

But you /could/ bench it. On target system of course. But it's evil. Might
get a "map used in void context" warning. 

 

[more snipped]

 


_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to