I've heard a complain from one guy in another mailing list about gzip
recently. He was trying to backup tens-of-GB data every day and
tar-gzipping (tar czvf) is so unacceptably slow.

I once faced the same problem when I needed to create hard drive
snapshot for computers and obviously I wanted to save bandwidth so that
I wouldn't have to transfer a lot of data over a 100Mbps line.

Let's suppose we can save 5GB on a 15GB file by compressing that file.
To transfer 15GB we need 15,000 MB / (100/8) MB/sec = 1,200 secs =  20
mins on a perfect network. Usually on Truman network (cross-buildings)
it takes 3 times as much. So realistically we need 60 minutes to
transfer a 15GB snapshot image. By compressing, the resulting 10GB  file
would take only 40 mins to transfer. Good deal? No.

It *didn't help*. It takes more than 1 hour to compress that file, so
the uploading process takes even longer. The clients (pentium4 2.8 HT)
somehow struggles to decompress the file too, so the result comes out
even. So why the hassle? My conclusion: It's better *not* to compress
the image with gzip at all. It's even clearer to see when you have a
fast connection, the IO gain goes to CPU computation, the result comes
out worse.

Turns out gzip, also, bzip2 and zip are terrible in CPU usage, as it
takes a lot of time to compress and decompress. There are other
algorithms that compress a little bit worse than gzip but is much easier
on the CPU (most of them are based on the Lempel-Ziv algorithm): LZO,
Google's Snappy, LZF, and LZ4. LZ4 is crazily fast. 

I did some quick bench-marking with the linux source:

1634!ht:~/src/lz4-read-only$ time ./tar-none.sh ../linux-3.0-rc6 linux-s
real    0m4.390s
user    0m0.620s
sys     0m0.870s

1635!ht:~/src/lz4-read-only$ time ./tar-gzip.sh ../linux-3.0-rc6 linux-s
real    0m43.683s
user    0m40.901s
sys     0m0.319s

1636!ht:~/src/lz4-read-only$ time ./tar-lz4.sh ../linux-3.0-rc6 linux-s
real    0m5.568s
user    0m4.831s
sys     0m0.272s

Clear win for lz4! (I used pipe, so theoretically it can be even
better).

I have patched lz4 utility so that it would happily accept std for stdin
for infile, and also std for stdout for outfile, so you can pipe from
whatever program you like.

git clone g...@github.com:htruong/lz4.git for the utility.


Cheers, nice weekend,
- Huan.
-- 
Huan Truong
600-988-9066
http://tnhh.net/

Reply via email to