Re: [Boston.pm] perl program to count distinct values - can it be made faster

Bill Ricker Sat, 08 Mar 2014 11:32:20 -0800

Hi Steve,

You've got AIX again?  Cool.
I like AIX. Let me know if you need help ! (not just perl)


>  Or, maybe someone has a different and faster program that does count
distinct on the fields in a
file.

Outside the box, If the data is coming from or going to a SQL system, SQL
can do the COUNT DISTINCT for free as a side effect of indexing.

>  I deliberately do not want to use features only in "modern" perl, or
rely on any modules that must be installed.

If a script needs to be maximally portable, sadly true.
But modules that wrap C routines not built in can be the performance win.
Don't over-do this by writing Perl4 or C-ish Perl, which is not going to be
as fast as Perl 5.005 dialect; use implicit loops wherever possible. Code
you don't write is faster than code you do write, and has less bugs too.

The general mantra for speed is to use built-in features wherever possible,
use expressions and logic as powerfully as possible. Your code that
is interpreted (compiled into op-trees) is slower than the C code in Perl
Core; run in Perl core as much as possible.

( This is sometimes in opposition to mantras for readability / cleanliness.
If speed matters and only when speed really matters, a little tactical ugly
will be worth it.)

i. General Comments on Huge Files

> are 10 GB in size, so I am reluctant to use slurp to read it all at once.

if 32 bit Perl, reluctance isn't the word; it would die die die. 2G limits
abound.
If this AIX has 64BIT perl (mine did recently), an account with high enough
ULIMIT *could* slurp -- quite possibly without swapping.

However, since slurping into an array creates per-string overhead for each
record, it may not be advisable with BigData even with high ULIMIT and big
RAM.

But that's not the only choice. It's plausible that even with swapping,
slurping *into a scalar* would be faster since you'll step through in
predictable fashion. And on AIX it might not even swap unless you've got a
scrawny AIX !.
Loop with no copy of records (avoiding malloc), only shuffling index
scalars as quasi-pointers.
   $right=index($buf,'\n', $left); # find next
   @data=split( /,/ substr($buf,$left,$right-$left))
   $left=$right+1;
But you can't time it on swapping PC to predict non-swapping server time.

You could even loop likewise on index(',') to substr ref the fields to
avoid malloc shuffling the fields too.
Which might be a big win if malloc is the main time sink as it often is.
(C stuffing '\0' onto each ',' would be even faster of course. But C
doesn't auto vivify the tally hash.)

More efficient might be MMAPing the file, let the swapping mechanism do the
*only* read for you, but this requires a C wrapper module for posix mmap().
(Again requires 64bit OS &  64 bit perl.)
http://stackoverflow.com/questions/1052765/linux-perl-mmap-performance &
http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results


You might not need those options if the following get you fast enough.


ii. Comments on details of code.

1. use built-in $. instead of $numlines.
 it should be slightly faster once per record.
(Implicit code is tight C not interpreted.)

Yes, $. is available after
  while(<>){};
finishes.
It will even be right if more than one file was read, *unless*
    continue { close ARGV if eof; }  # Not eof()!
was used to reset $. to 1 at top of each file (as is done for "$ARGV:$.: "
prefix in warnings)
http://perldoc.perl.org/functions/eof.html

2. pull
   my @data;
out of loop, so not stack shuffling in the loop. Yes it's cleaner but it's
costing you some malloc.
Could replace at top of loop with
   @data = () ; # don't
only *if* re-init were needed,
but it is not needed here since Split never fails to return, and is never
bypassed by logic.
   @data=() ; # is only needed before
   @data = ($1,$2) if /(a+)(b+) /;
or other logic which might skip the assignment.

3. while(<>) is doing the buffered read you want for you, with OS
prefetching next buffer if you're lucky.
So switching to sysread() may not be of much advantage and leaves you to
parse out records.
Might gain some efficiency processing several records in mock-parallel if
split buffer to multiple records.
BUT adds complexity of skipping and retaining partial last record to glue
onto 1st record of next buffer after split.
(for malloc efficiency, glue remainder to first rec of next buffer, NOT to
next buffer before split.)
May not be worth it !

4. Inner Loop Optimization : the auto-viv autoincrement Array of Hash.
Heuristic - What's slightly slow in inner loop is multiplied by biggest
multiplier, so look here first.

4A. Normally i'd use
Heuristic: convert the hash-array update loop to array/hash slice [
www.stonehenge.com/*merlyn*/UnixReview/col68.html  ]
We'd like 0 .. $NF range to replace $i and @data replace $data[$i] for
parallel operation.
(again, Implicit code is tight C not interpreted.)
But i don't immediately see a way to make this loop
   for (my $i = 0; $i < scalar @data; $i++) {  $aoh[$i]{$data[$i]}++; }
fully implicit with slices the way i'd do for typical CSV unpacking like
  *@H*ofFields*{*@AofFieldNames*}* = @splitdata; # hash slice
since auto increment is intrinsically scalar oriented. AFAIK i can't ++ a
slice.

4B. We can at least save inner loop re-indexing time.

Heuristic: Whenever the C/Fortran for(;;) loop can be replaced with
foreach() or while(), do so.
( unless replaceable by Slice (4A above), map, or other parallel ops.)

So
     $aoh[$_]{$data[$_]}++ foreach 0 .. $#data;
which will be faster since interpreted $i  twiddling is replaced by
implicit loop control.

foreach (0..$#blah) { } should be equally fast as statement modifier form.
Confusingly, foreach() is may be abbreviated for().

4C While in general for readability and safety,
    foreach my $niceScopedName (..){block}
is preferred to (re)using implicit $_
-- although it is safe even under while(<>) if you don't refer to both at
once ! --
it is plausible it might be slower than pushing a $_ context, so i showed
the $_ form in 4B.

You could bench it, or just just follow the heuristic for raw speed:
Always prefer the built-in, it's more likely optimized.

4D Using map{} or worse grep{} as synonym of foreach just for the
side-effects would be evil
   map { $aoh[$_]{$data[$_]}++ }  ( 0 .. @#data ) ;
and shouldn't be any faster than the foreach().
No benefit to being cryptic here, so don't.
But you /could/ bench it. On target system of course. But it's evil. Might
get a "map used in void context" warning.

5. Comment out the old-perl workaround
   if ( $split_char ne '' )
unless/until you prove you have an old enough perl to cause trouble
*and* you need to implement getopt.
Checking it 1E6 to 1E8 times inside the loop isn't doing you any favors,
and is useless as long as split_char is hard-coded. Maybe the optimizer
notices that $split_char is fixed ',' but don't count on it.
(Leave it in file, commented out just in case you (or maintenance coder)
get '' later and it croaks, so you have the fix handy.)

-- BILL

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] perl program to count distinct values - can it be made faster

Reply via email to