I've heard a complain from one guy in another mailing list about gzip recently. He was trying to backup tens-of-GB data every day and tar-gzipping (tar czvf) is so unacceptably slow.
I once faced the same problem when I needed to create hard drive snapshot for computers and obviously I wanted to save bandwidth so that I wouldn't have to transfer a lot of data over a 100Mbps line. Let's suppose we can save 5GB on a 15GB file by compressing that file. To transfer 15GB we need 15,000 MB / (100/8) MB/sec = 1,200 secs = 20 mins on a perfect network. Usually on Truman network (cross-buildings) it takes 3 times as much. So realistically we need 60 minutes to transfer a 15GB snapshot image. By compressing, the resulting 10GB file would take only 40 mins to transfer. Good deal? No. It *didn't help*. It takes more than 1 hour to compress that file, so the uploading process takes even longer. The clients (pentium4 2.8 HT) somehow struggles to decompress the file too, so the result comes out even. So why the hassle? My conclusion: It's better *not* to compress the image with gzip at all. It's even clearer to see when you have a fast connection, the IO gain goes to CPU computation, the result comes out worse. Turns out gzip, also, bzip2 and zip are terrible in CPU usage, as it takes a lot of time to compress and decompress. There are other algorithms that compress a little bit worse than gzip but is much easier on the CPU (most of them are based on the Lempel-Ziv algorithm): LZO, Google's Snappy, LZF, and LZ4. LZ4 is crazily fast. I did some quick bench-marking with the linux source: 1634!ht:~/src/lz4-read-only$ time ./tar-none.sh ../linux-3.0-rc6 linux-s real 0m4.390s user 0m0.620s sys 0m0.870s 1635!ht:~/src/lz4-read-only$ time ./tar-gzip.sh ../linux-3.0-rc6 linux-s real 0m43.683s user 0m40.901s sys 0m0.319s 1636!ht:~/src/lz4-read-only$ time ./tar-lz4.sh ../linux-3.0-rc6 linux-s real 0m5.568s user 0m4.831s sys 0m0.272s Clear win for lz4! (I used pipe, so theoretically it can be even better). I have patched lz4 utility so that it would happily accept std for stdin for infile, and also std for stdout for outfile, so you can pipe from whatever program you like. git clone g...@github.com:htruong/lz4.git for the utility. Cheers, nice weekend, - Huan. -- Huan Truong 600-988-9066 http://tnhh.net/