Re: [Boston.pm] perl program to count distinct values - can it be made faster

Charles Reitzel Sat, 08 Mar 2014 11:11:32 -0800

I think Gyepi SAM is getting close to the issue: regex. it is clearlyNOT i/o bound if scanning once per field is about as fast. Regexesare slow. Very slow. Replace w/ simple parsing logic based on stringposition and you'll get the speed up.

Note, I like regex just fine, but I have found them to be thebottleneck more than twice and have learned to avoid them for simplepatterns inside a high frequency loop.


Quoting Gyepi SAM <[email protected]>:

On Sat, Mar 08, 2014 at 10:59:22AM -0500, Steve Tolkin wrote:

# return code: 0 == success; 1 == some warnings; 2 == some errors
my $rc = 0;


This value never changes. I assume the larger program could change it.

my $split_char=',';  # CHANGE ME IF NEEDED (later use getopts)

my @aoh; # array of hashes: key is the field value; value is the count
my $numlines = 0;
my $firstline;
while (<>)
{
    chomp;
    $firstline = $_ if $numlines == 0;


More idiomatically written as:

     $firstline ||= $_;

    $numlines++;
    my @data;

    # Seemingly extra code below is for compatibility with older perl
versions
    # But this might not be needed anymore.
    if ( $split_char ne '' )
    { @data = split(/$split_char/,$_); }
    else
    { @data = split; }

If you're really looking for optimizations, this test can be moved outside
the loop. Since no arg split splits on spaces, your check can be something
like:
        $split_char = ' ' unless $split_char;

Then the test and two split options can be replaced with

        @data = split /$split_char/, $_;

There may also be some benefit to hoisting the regular expressionoutside the loop:


    my $re = qr/$split_char/o;
    ...
    @data = split $re, $_;

If there's any, it will be tiny, but may be appreciable given yourinput size.

    # Populate array of hashes.  (This uses perl "autovivification"
feature.)
    for (my $i = 0; $i < scalar @data; $i++) {
        $aoh[$i]{$data[$i]}++;
    }
}


Nit: "scalar @data" can be replaced with @data.

# print output
print "filename: $ARGV\n"; # This writes a "-" if data is piped in
if ($numlines >0) {
    print "firstline: $firstline\n";
}
print "numlines: $numlines\n";
for (my $i = 0; $i < scalar @aoh; $i++) {
    # The number of keys in a hash is the "count distinct" number
    print "field#: ", $i, ", distinct: ", scalar keys %{$aoh[$i]}, "\n";

Nit: This reads better as a printf, I think.

    printf "field#: %d, distinct: %d\n", $i, scalar keys %{$aoh[$i]};

}

exit $rc;


This is always 0, as noted above.

My initial thought at improvement was to avoid the split and walkthrough each

line looking for a $split_char or a "\n", but that just duplicates split in
perl instead of C. I think you've got just about the fastest program in perl.

For fun, I wrote a version in Go and it's twice as fast as the perl
version. I imagine a C version would be faster yet, but I get paid for that
kind of fun. I'd be happy to send you the Go version if you're interested.

-Gyepi

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm





_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] perl program to count distinct values - can it be made faster

Reply via email to