I wrote a simple perl program to count the number of distinct values in each
field of a file.  It just reads each line once, sequentially.  The input
files vary in the number of rows: 1 million is typical, but some have as
many as 100 million rows, and are 10 GB in size, so I am reluctant to use
slurp to read it all at once.  It processes about 100,000 rows/second both
on my PC and on the AIX server which is the real target machine.  To my
surprise it is barely faster than a shell script that reads the entire file
multiple times, once per field, even for a file with 5 fields.   The shell
script calls a pipeline like this for each field: awk to extract 1 field
value | sort -u | wc -l

Here is nearly all of the perl program count_distinct.pl -- the important
code is just after the word "autovivication":

# For each field we increment the count for the data value in that field.
We
# currently do not actually uses those counts, so we could have just said
# anything on the RHS e.g. =1.  But the counts might come in handy later,
# e.g., if we want to see the most frequent value.

use strict;

# return code: 0 == success; 1 == some warnings; 2 == some errors
my $rc = 0;
my $split_char=',';  # CHANGE ME IF NEEDED (later use getopts)

my @aoh; # array of hashes: key is the field value; value is the count
my $numlines = 0;
my $firstline;
while (<>)
{
    chomp;
    $firstline = $_ if $numlines == 0; 
    $numlines++;
    my @data;
    
    # Seemingly extra code below is for compatibility with older perl
versions
    # But this might not be needed anymore.
    if ( $split_char ne '' )
    { @data = split(/$split_char/,$_); }
    else
    { @data = split; }
    
    # Populate array of hashes.  (This uses perl "autovivification"
feature.)
    for (my $i = 0; $i < scalar @data; $i++) {
        $aoh[$i]{$data[$i]}++;
    }
}

# print output
print "filename: $ARGV\n"; # This writes a "-" if data is piped in
if ($numlines >0) {
    print "firstline: $firstline\n";
}
print "numlines: $numlines\n";
for (my $i = 0; $i < scalar @aoh; $i++) {
    # The number of keys in a hash is the "count distinct" number
    print "field#: ", $i, ", distinct: ", scalar keys %{$aoh[$i]}, "\n";
}
exit $rc;

P.S. Suggestions to improve the style are also welcome.  I have not written
perl for years.  I deliberately do not want to use features only in "modern"
perl, or rely on any modules that must be installed.  Or, maybe someone has
a different and faster program that does count distinct on the fields in a
file.
  
-- 
thanks,
Steve Tolkin





_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to