On Sat, Mar 8, 2014 at 10:59 AM, Steve Tolkin <[email protected]>wrote:
> I wrote a simple perl program to count the number of distinct values in > each > field of a file. It just reads each line once, sequentially. The input > files vary in the number of rows: 1 million is typical, but some have as > many as 100 million rows, and are 10 GB in size, so I am reluctant to use > slurp to read it all at once. It processes about 100,000 rows/second both > on my PC and on the AIX server which is the real target machine. To my > surprise it is barely faster than a shell script that reads the entire file > multiple times, once per field, even for a file with 5 fields. The shell > script calls a pipeline like this for each field: awk to extract 1 field > value | sort -u | wc -l > I'm really curious about this benchmarking. I feel like most of the time on this is I/O, so I wonder if the shell script speed isn't due to benchmarking with a small file that remains cached in memory. Also, I feel like you might be having an issue with the small read buffer size in <> reads. The last time I had to write a tool like this to parse 10GB files, it ran a hell of a lot faster when I did my I/O with sysread in 1MB-or-larger chunks, and iterated over the buffer with m//g. Fair warning: This is an awesome and useful method right up to the point where the files have so many unique values that the hashes can no longer be stored in physical memory. Once anything hash-based starts swapping, performance goes out the window and never comes back. This should start to happen somewhere past 100M unique values, so you're probably safe. -C. _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

