Re: [Boston.pm] perl program to count distinct values - can it be made faster

David Larochelle Sat, 08 Mar 2014 09:58:27 -0800

Steve,

I think the disk I/O going to be your performance bottle neck not the CPU
usage.
( Disk is going to greatly complicate bench marking. But you might want to
test how long it takes to cat the file to /dev/null.)


You may be better off sticking to the shell script since it is more concise
and readable than perl code.

In general, I'm not a fan of reinventing the wheel and I would suggest that
you reconsider your requirement of avoiding modern perl features and CPAN
modules.
Perl versions 5.8 and earlier just don't have very many
libraries/functionality in the core.
If you're deployment requirements mandate support for 5.8 with no CPAN
modules, Perl is probably the wrong choice and you'd be better off using a
language that includes more libraries in the core.

In terms of style, I suggest you take a look at Perl Best practices. I
would change

$firstline = $_ if $numlines == 0;


to

 if ($numlines == 0)
{
     $firstline = $_;
}

--

David

On Sat, Mar 8, 2014 at 10:59 AM, Steve Tolkin <[email protected]>wrote:

> I wrote a simple perl program to count the number of distinct values in
> each
> field of a file.  It just reads each line once, sequentially.  The input
> files vary in the number of rows: 1 million is typical, but some have as
> many as 100 million rows, and are 10 GB in size, so I am reluctant to use
> slurp to read it all at once.  It processes about 100,000 rows/second both
> on my PC and on the AIX server which is the real target machine.  To my
> surprise it is barely faster than a shell script that reads the entire file
> multiple times, once per field, even for a file with 5 fields.   The shell
> script calls a pipeline like this for each field: awk to extract 1 field
> value | sort -u | wc -l
>
> Here is nearly all of the perl program count_distinct.pl -- the important
> code is just after the word "autovivication":
>
> # For each field we increment the count for the data value in that field.
> We
> # currently do not actually uses those counts, so we could have just said
> # anything on the RHS e.g. =1.  But the counts might come in handy later,
> # e.g., if we want to see the most frequent value.
>
> use strict;
>
> # return code: 0 == success; 1 == some warnings; 2 == some errors
> my $rc = 0;
> my $split_char=',';  # CHANGE ME IF NEEDED (later use getopts)
>
> my @aoh; # array of hashes: key is the field value; value is the count
> my $numlines = 0;
> my $firstline;
> while (<>)
> {
>     chomp;
>     $firstline = $_ if $numlines == 0;
>     $numlines++;
>     my @data;
>
>     # Seemingly extra code below is for compatibility with older perl
> versions
>     # But this might not be needed anymore.
>     if ( $split_char ne '' )
>     { @data = split(/$split_char/,$_); }
>     else
>     { @data = split; }
>
>     # Populate array of hashes.  (This uses perl "autovivification"
> feature.)
>     for (my $i = 0; $i < scalar @data; $i++) {
>         $aoh[$i]{$data[$i]}++;
>     }
> }
>
> # print output
> print "filename: $ARGV\n"; # This writes a "-" if data is piped in
> if ($numlines >0) {
>     print "firstline: $firstline\n";
> }
> print "numlines: $numlines\n";
> for (my $i = 0; $i < scalar @aoh; $i++) {
>     # The number of keys in a hash is the "count distinct" number
>     print "field#: ", $i, ", distinct: ", scalar keys %{$aoh[$i]}, "\n";
> }
> exit $rc;
>
> P.S. Suggestions to improve the style are also welcome.  I have not written
> perl for years.  I deliberately do not want to use features only in
> "modern"
> perl, or rely on any modules that must be installed.  Or, maybe someone has
> a different and faster program that does count distinct on the fields in a
> file.
>
> --
> thanks,
> Steve Tolkin
>
>
>
>
>
> _______________________________________________
> Boston-pm mailing list
> [email protected]
> http://mail.pm.org/mailman/listinfo/boston-pm
>

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] perl program to count distinct values - can it be made faster

Reply via email to