On Thu, 18 Jan 2007, Jim Meyering wrote:

I've done some more timings, but with two more sizes of input.
Here's the summary, comparing straight sort with sort --comp=gzip:

 2.7GB:   6.6% speed-up
 10.0GB: 17.8% speed-up

It would be interesting to see the individual stats returned by wait4(2) from the child, to separate CPU seconds spent in sort itself, and in the compression/decompression forks.

I think allowing an environment variable to define the compressor is a good idea, so long as there's a corresponding --nocompress override available from the command line.

 $ seq 9999999 > k
 $ cat k k k k k k k k k > j
 $ cat j j j j > sort-in
 $ wc -c sort-in
 2839999968 sort-in

I had to use "seq -f %.0f" to get this filesize.

With --compress=gzip:
 $ /usr/bin/time ./sort -T. --compress=gzip < sort-in > out
 814.07user 29.97system 14:50.16elapsed 94%CPU (0avgtext+0avgdata 
0maxresident)k  0inputs+0outputs (4major+2821589minor)pagefaults 0swaps

There's a big difference in the time spent on gzip compression depending on the -1/-9 option (default -6). For a similar seq-generated data set above, I get
gzip -1: User time (seconds): 48.63, output size is 6% of input
gzip -9: User time (seconds): 952.97, output size is 3% of input

Decompression time for both tests shows less variation (25s vs 21s).

This would suggest the elapsed time to sort can be improved by trading compression ratio for less CPU time. Obviously a critical factor is the disk latency.


Cheers,
Phil


_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Reply via email to