Re: [Boston.pm] perl program to count distinct values - can it be made faster

Conor Walsh Sat, 08 Mar 2014 15:04:26 -0800

On Sat, Mar 8, 2014 at 10:59 AM, Steve Tolkin <[email protected]>wrote:


> I wrote a simple perl program to count the number of distinct values in
> each
> field of a file.  It just reads each line once, sequentially.  The input
> files vary in the number of rows: 1 million is typical, but some have as
> many as 100 million rows, and are 10 GB in size, so I am reluctant to use
> slurp to read it all at once.  It processes about 100,000 rows/second both
> on my PC and on the AIX server which is the real target machine.  To my
> surprise it is barely faster than a shell script that reads the entire file
> multiple times, once per field, even for a file with 5 fields.   The shell
> script calls a pipeline like this for each field: awk to extract 1 field
> value | sort -u | wc -l
>

I'm really curious about this benchmarking.  I feel like most of the time
on this is I/O, so I wonder if the shell script speed isn't due to
benchmarking with a small file that remains cached in memory.

Also, I feel like you might be having an issue with the small read buffer
size in <> reads.  The last time I had to write a tool like this to parse
10GB files, it ran a hell of a lot faster when I did my I/O with sysread in
1MB-or-larger chunks, and iterated over the buffer with m//g.

Fair warning:  This is an awesome and useful method right up to the point
where the files have so many unique values that the hashes can no longer be
stored in physical memory.  Once anything hash-based starts swapping,
performance goes out the window and never comes back.  This should start to
happen somewhere past 100M unique values, so you're probably safe.

-C.

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] perl program to count distinct values - can it be made faster

Reply via email to