On Thu, 18 Jan 2007, Jim Meyering wrote:
I've done some more timings, but with two more sizes of input.
Here's the summary, comparing straight sort with sort --comp=gzip:
2.7GB: 6.6% speed-up
10.0GB: 17.8% speed-up
It would be interesting to see the individual stats returned by wait4(2)
from the child, to separate CPU seconds spent in sort itself, and in the
compression/decompression forks.
I think allowing an environment variable to define the compressor is a
good idea, so long as there's a corresponding --nocompress override
available from the command line.
$ seq 9999999 > k
$ cat k k k k k k k k k > j
$ cat j j j j > sort-in
$ wc -c sort-in
2839999968 sort-in
I had to use "seq -f %.0f" to get this filesize.
With --compress=gzip:
$ /usr/bin/time ./sort -T. --compress=gzip < sort-in > out
814.07user 29.97system 14:50.16elapsed 94%CPU (0avgtext+0avgdata
0maxresident)k 0inputs+0outputs (4major+2821589minor)pagefaults 0swaps
There's a big difference in the time spent on gzip compression depending
on the -1/-9 option (default -6). For a similar seq-generated data set
above, I get
gzip -1: User time (seconds): 48.63, output size is 6% of input
gzip -9: User time (seconds): 952.97, output size is 3% of input
Decompression time for both tests shows less variation (25s vs 21s).
This would suggest the elapsed time to sort can be improved by trading
compression ratio for less CPU time. Obviously a critical factor is the
disk latency.
Cheers,
Phil
_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils