Oh, and you shouldn't use 'std' for stdin and out. You should use '-'. That's what many programs do (including gzip for example); hyphen will be more familiar to experinced users since it's already an established interface rule.
Don On Fri, Jul 8, 2011 at 4:31 PM, Don Bindner <don.bind...@gmail.com> wrote: > Did you remember to run your tests repeatedly in different orders to > minimize the effects that cacheing might have on your results? > > Don > > > On Fri, Jul 8, 2011 at 4:19 PM, Huan Truong <hnt7...@truman.edu> wrote: > >> I've heard a complain from one guy in another mailing list about gzip >> recently. He was trying to backup tens-of-GB data every day and >> tar-gzipping (tar czvf) is so unacceptably slow. >> >> I once faced the same problem when I needed to create hard drive >> snapshot for computers and obviously I wanted to save bandwidth so that >> I wouldn't have to transfer a lot of data over a 100Mbps line. >> >> Let's suppose we can save 5GB on a 15GB file by compressing that file. >> To transfer 15GB we need 15,000 MB / (100/8) MB/sec = 1,200 secs = 20 >> mins on a perfect network. Usually on Truman network (cross-buildings) >> it takes 3 times as much. So realistically we need 60 minutes to >> transfer a 15GB snapshot image. By compressing, the resulting 10GB file >> would take only 40 mins to transfer. Good deal? No. >> >> It *didn't help*. It takes more than 1 hour to compress that file, so >> the uploading process takes even longer. The clients (pentium4 2.8 HT) >> somehow struggles to decompress the file too, so the result comes out >> even. So why the hassle? My conclusion: It's better *not* to compress >> the image with gzip at all. It's even clearer to see when you have a >> fast connection, the IO gain goes to CPU computation, the result comes >> out worse. >> >> Turns out gzip, also, bzip2 and zip are terrible in CPU usage, as it >> takes a lot of time to compress and decompress. There are other >> algorithms that compress a little bit worse than gzip but is much easier >> on the CPU (most of them are based on the Lempel-Ziv algorithm): LZO, >> Google's Snappy, LZF, and LZ4. LZ4 is crazily fast. >> >> I did some quick bench-marking with the linux source: >> >> 1634!ht:~/src/lz4-read-only$ time ./tar-none.sh ../linux-3.0-rc6 linux-s >> real 0m4.390s >> user 0m0.620s >> sys 0m0.870s >> >> 1635!ht:~/src/lz4-read-only$ time ./tar-gzip.sh ../linux-3.0-rc6 linux-s >> real 0m43.683s >> user 0m40.901s >> sys 0m0.319s >> >> 1636!ht:~/src/lz4-read-only$ time ./tar-lz4.sh ../linux-3.0-rc6 linux-s >> real 0m5.568s >> user 0m4.831s >> sys 0m0.272s >> >> Clear win for lz4! (I used pipe, so theoretically it can be even >> better). >> >> I have patched lz4 utility so that it would happily accept std for stdin >> for infile, and also std for stdout for outfile, so you can pipe from >> whatever program you like. >> >> git clone g...@github.com:htruong/lz4.git for the utility. >> >> >> Cheers, nice weekend, >> - Huan. >> -- >> Huan Truong >> 600-988-9066 >> http://tnhh.net/ >> >> >