Summary: The easy changes made it about 1% faster. The diea of using slices
is appealing, and might work, but I did not find the right syntax to get it
to work.
Details: Thanks to everyone's advice. I made many of the suggested changes,
e.g., declared my @data outside the loop, used $. rather than my $numlines,
and two changes that could have materially improved performance. Moved the
regex from the inner loop by
my $split_pat=','; # CHANGE ME IF NEEDED (later use getopts)
my $re = qr/$split_pat/o;
...
@data = split($re, $_);
and also, as Bill suggested, changed from:
for (my $i = 0; $i < @data; $i++) { $aoh[$i]{$data[$i]}++; } # original
to:
for (0..$#data) { $aoh[$_]{$data[$_]}++; }
Taken together all these changes made it run a little more than 1% faster on
one test file on my PC that takes about 7 seconds to process. (The real
files can take > 1 minute and I do not have access to the machine now.) I
measured by adding
use Time::HiRes qw( gettimeofday tv_interval );
In the original email I said
# For each field we increment the count for the data value in that field. We
# currently do not actually uses those counts, so we could have just said
# anything on the RHS e.g. =1. But the counts might come in handy later,
Since I can just set a constant value I wanted to try Bill's idea (see
below) of using slices, something like this.
$aoh[0..$#data]{@data} = 1;
But this did not work. Something like this ought to work since the number
of hashes will always equal the number of data elements. But I cannot get
the sigils right.
Is there a way to do the above?
--
Thanks,
Steve Tolkin
From: Bill Ricker [mailto:[email protected]]
Sent: Saturday, March 08, 2014 2:30 PM
To: Steve Tolkin
Cc: Boston Perl Mongers
Subject: Re: [Boston.pm] perl program to count distinct values - can it be
made faster
[much snipped]
4A. Normally i'd use
Heuristic: convert the hash-array update loop to array/hash slice [
www.stonehenge.com/merlyn/UnixReview/col68.html ]
We'd like 0 .. $NF range to replace $i and @data replace $data[$i] for
parallel operation.
(again, Implicit code is tight C not interpreted.)
But i don't immediately see a way to make this loop
for (my $i = 0; $i < scalar @data; $i++) { $aoh[$i]{$data[$i]}++; }
fully implicit with slices the way i'd do for typical CSV unpacking like
@HofFields{@AofFieldNames} = @splitdata; # hash slice
since auto increment is intrinsically scalar oriented. AFAIK i can't ++ a
slice.
4B. We can at least save inner loop re-indexing time.
Heuristic: Whenever the C/Fortran for(;;) loop can be replaced with
foreach() or while(), do so.
( unless replaceable by Slice (4A above), map, or other parallel ops.)
So
$aoh[$_]{$data[$_]}++ foreach 0 .. $#data;
which will be faster since interpreted $i twiddling is replaced by implicit
loop control.
foreach (0..$#blah) { } should be equally fast as statement modifier form.
Confusingly, foreach() is may be abbreviated for().
4C While in general for readability and safety,
foreach my $niceScopedName (..){block}
is preferred to (re)using implicit $_
-- although it is safe even under while(<>) if you don't refer to both at
once ! --
it is plausible it might be slower than pushing a $_ context, so i showed
the $_ form in 4B.
You could bench it, or just just follow the heuristic for raw speed:
Always prefer the built-in, it's more likely optimized.
4D Using map{} or worse grep{} as synonym of foreach just for the
side-effects would be evil
map { $aoh[$_]{$data[$_]}++ } ( 0 .. @#data ) ;
and shouldn't be any faster than the foreach().
No benefit to being cryptic here, so don't.
But you /could/ bench it. On target system of course. But it's evil. Might
get a "map used in void context" warning.
[more snipped]
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm