On Sat, Mar 08, 2014 at 10:59:22AM -0500, Steve Tolkin wrote:
> # return code: 0 == success; 1 == some warnings; 2 == some errors
> my $rc = 0;
This value never changes. I assume the larger program could change it.
> my $split_char=','; # CHANGE ME IF NEEDED (later use getopts)
>
> my @aoh; # array of hashes: key is the field value; value is the count
> my $numlines = 0;
> my $firstline;
> while (<>)
> {
> chomp;
> $firstline = $_ if $numlines == 0;
More idiomatically written as:
$firstline ||= $_;
> $numlines++;
> my @data;
>
> # Seemingly extra code below is for compatibility with older perl
> versions
> # But this might not be needed anymore.
> if ( $split_char ne '' )
> { @data = split(/$split_char/,$_); }
> else
> { @data = split; }
If you're really looking for optimizations, this test can be moved outside
the loop. Since no arg split splits on spaces, your check can be something
like:
$split_char = ' ' unless $split_char;
Then the test and two split options can be replaced with
@data = split /$split_char/, $_;
There may also be some benefit to hoisting the regular expression outside the
loop:
my $re = qr/$split_char/o;
...
@data = split $re, $_;
If there's any, it will be tiny, but may be appreciable given your input size.
> # Populate array of hashes. (This uses perl "autovivification"
> feature.)
> for (my $i = 0; $i < scalar @data; $i++) {
> $aoh[$i]{$data[$i]}++;
> }
> }
Nit: "scalar @data" can be replaced with @data.
> # print output
> print "filename: $ARGV\n"; # This writes a "-" if data is piped in
> if ($numlines >0) {
> print "firstline: $firstline\n";
> }
> print "numlines: $numlines\n";
> for (my $i = 0; $i < scalar @aoh; $i++) {
> # The number of keys in a hash is the "count distinct" number
> print "field#: ", $i, ", distinct: ", scalar keys %{$aoh[$i]}, "\n";
Nit: This reads better as a printf, I think.
printf "field#: %d, distinct: %d\n", $i, scalar keys %{$aoh[$i]};
> }
>
> exit $rc;
This is always 0, as noted above.
My initial thought at improvement was to avoid the split and walk through each
line looking for a $split_char or a "\n", but that just duplicates split in
perl instead of C. I think you've got just about the fastest program in perl.
For fun, I wrote a version in Go and it's twice as fast as the perl
version. I imagine a C version would be faster yet, but I get paid for that
kind of fun. I'd be happy to send you the Go version if you're interested.
-Gyepi
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm