Re: [Boston.pm] perl program to count distinct values - can it be made faster

Gyepi SAM Sat, 08 Mar 2014 10:52:28 -0800

On Sat, Mar 08, 2014 at 10:59:22AM -0500, Steve Tolkin wrote:
> # return code: 0 == success; 1 == some warnings; 2 == some errors
> my $rc = 0;


This value never changes. I assume the larger program could change it.

> my $split_char=',';  # CHANGE ME IF NEEDED (later use getopts)
> 
> my @aoh; # array of hashes: key is the field value; value is the count
> my $numlines = 0;
> my $firstline;
> while (<>)
> {
>     chomp;
>     $firstline = $_ if $numlines == 0; 

More idiomatically written as:

     $firstline ||= $_;

>     $numlines++;
>     my @data;
>     
>     # Seemingly extra code below is for compatibility with older perl
> versions
>     # But this might not be needed anymore.
>     if ( $split_char ne '' )
>     { @data = split(/$split_char/,$_); }
>     else
>     { @data = split; }
If you're really looking for optimizations, this test can be moved outside
the loop. Since no arg split splits on spaces, your check can be something
like:
        $split_char = ' ' unless $split_char;

Then the test and two split options can be replaced with 

        @data = split /$split_char/, $_;

There may also be some benefit to hoisting the regular expression outside the 
loop:

    my $re = qr/$split_char/o;
    ...
    @data = split $re, $_;
    
If there's any, it will be tiny, but may be appreciable given your input size.

>     # Populate array of hashes.  (This uses perl "autovivification"
> feature.)
>     for (my $i = 0; $i < scalar @data; $i++) {
>         $aoh[$i]{$data[$i]}++;
>     }
> }

Nit: "scalar @data" can be replaced with @data.

> # print output
> print "filename: $ARGV\n"; # This writes a "-" if data is piped in
> if ($numlines >0) {
>     print "firstline: $firstline\n";
> }
> print "numlines: $numlines\n";
> for (my $i = 0; $i < scalar @aoh; $i++) {
>     # The number of keys in a hash is the "count distinct" number
>     print "field#: ", $i, ", distinct: ", scalar keys %{$aoh[$i]}, "\n";
Nit: This reads better as a printf, I think.

    printf "field#: %d, distinct: %d\n", $i, scalar keys %{$aoh[$i]};

> }
>
> exit $rc;

This is always 0, as noted above. 

My initial thought at improvement was to avoid the split and walk through each
line looking for a $split_char or a "\n", but that just duplicates split in
perl instead of C. I think you've got just about the fastest program in perl.

For fun, I wrote a version in Go and it's twice as fast as the perl
version. I imagine a C version would be faster yet, but I get paid for that
kind of fun. I'd be happy to send you the Go version if you're interested.

-Gyepi

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] perl program to count distinct values - can it be made faster

Reply via email to