I wrote a simple perl program to count the number of distinct values in each
field of a file. It just reads each line once, sequentially. The input
files vary in the number of rows: 1 million is typical, but some have as
many as 100 million rows, and are 10 GB in size, so I am reluctant to use
slurp to read it all at once. It processes about 100,000 rows/second both
on my PC and on the AIX server which is the real target machine. To my
surprise it is barely faster than a shell script that reads the entire file
multiple times, once per field, even for a file with 5 fields. The shell
script calls a pipeline like this for each field: awk to extract 1 field
value | sort -u | wc -l
Here is nearly all of the perl program count_distinct.pl -- the important
code is just after the word "autovivication":
# For each field we increment the count for the data value in that field.
We
# currently do not actually uses those counts, so we could have just said
# anything on the RHS e.g. =1. But the counts might come in handy later,
# e.g., if we want to see the most frequent value.
use strict;
# return code: 0 == success; 1 == some warnings; 2 == some errors
my $rc = 0;
my $split_char=','; # CHANGE ME IF NEEDED (later use getopts)
my @aoh; # array of hashes: key is the field value; value is the count
my $numlines = 0;
my $firstline;
while (<>)
{
chomp;
$firstline = $_ if $numlines == 0;
$numlines++;
my @data;
# Seemingly extra code below is for compatibility with older perl
versions
# But this might not be needed anymore.
if ( $split_char ne '' )
{ @data = split(/$split_char/,$_); }
else
{ @data = split; }
# Populate array of hashes. (This uses perl "autovivification"
feature.)
for (my $i = 0; $i < scalar @data; $i++) {
$aoh[$i]{$data[$i]}++;
}
}
# print output
print "filename: $ARGV\n"; # This writes a "-" if data is piped in
if ($numlines >0) {
print "firstline: $firstline\n";
}
print "numlines: $numlines\n";
for (my $i = 0; $i < scalar @aoh; $i++) {
# The number of keys in a hash is the "count distinct" number
print "field#: ", $i, ", distinct: ", scalar keys %{$aoh[$i]}, "\n";
}
exit $rc;
P.S. Suggestions to improve the style are also welcome. I have not written
perl for years. I deliberately do not want to use features only in "modern"
perl, or rely on any modules that must be installed. Or, maybe someone has
a different and faster program that does count distinct on the fields in a
file.
--
thanks,
Steve Tolkin
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm