Oh, and you shouldn't use 'std' for stdin and out.  You should use '-'.
 That's what many programs do (including gzip for example); hyphen will be
more familiar to experinced users since it's already an established
interface rule.

Don

On Fri, Jul 8, 2011 at 4:31 PM, Don Bindner <don.bind...@gmail.com> wrote:

> Did you remember to run your tests repeatedly in different orders to
> minimize the effects that cacheing might have on your results?
>
> Don
>
>
> On Fri, Jul 8, 2011 at 4:19 PM, Huan Truong <hnt7...@truman.edu> wrote:
>
>> I've heard a complain from one guy in another mailing list about gzip
>> recently. He was trying to backup tens-of-GB data every day and
>> tar-gzipping (tar czvf) is so unacceptably slow.
>>
>> I once faced the same problem when I needed to create hard drive
>> snapshot for computers and obviously I wanted to save bandwidth so that
>> I wouldn't have to transfer a lot of data over a 100Mbps line.
>>
>> Let's suppose we can save 5GB on a 15GB file by compressing that file.
>> To transfer 15GB we need 15,000 MB / (100/8) MB/sec = 1,200 secs =  20
>> mins on a perfect network. Usually on Truman network (cross-buildings)
>> it takes 3 times as much. So realistically we need 60 minutes to
>> transfer a 15GB snapshot image. By compressing, the resulting 10GB  file
>> would take only 40 mins to transfer. Good deal? No.
>>
>> It *didn't help*. It takes more than 1 hour to compress that file, so
>> the uploading process takes even longer. The clients (pentium4 2.8 HT)
>> somehow struggles to decompress the file too, so the result comes out
>> even. So why the hassle? My conclusion: It's better *not* to compress
>> the image with gzip at all. It's even clearer to see when you have a
>> fast connection, the IO gain goes to CPU computation, the result comes
>> out worse.
>>
>> Turns out gzip, also, bzip2 and zip are terrible in CPU usage, as it
>> takes a lot of time to compress and decompress. There are other
>> algorithms that compress a little bit worse than gzip but is much easier
>> on the CPU (most of them are based on the Lempel-Ziv algorithm): LZO,
>> Google's Snappy, LZF, and LZ4. LZ4 is crazily fast.
>>
>> I did some quick bench-marking with the linux source:
>>
>> 1634!ht:~/src/lz4-read-only$ time ./tar-none.sh ../linux-3.0-rc6 linux-s
>> real    0m4.390s
>> user    0m0.620s
>> sys     0m0.870s
>>
>> 1635!ht:~/src/lz4-read-only$ time ./tar-gzip.sh ../linux-3.0-rc6 linux-s
>> real    0m43.683s
>> user    0m40.901s
>> sys     0m0.319s
>>
>> 1636!ht:~/src/lz4-read-only$ time ./tar-lz4.sh ../linux-3.0-rc6 linux-s
>> real    0m5.568s
>> user    0m4.831s
>> sys     0m0.272s
>>
>> Clear win for lz4! (I used pipe, so theoretically it can be even
>> better).
>>
>> I have patched lz4 utility so that it would happily accept std for stdin
>> for infile, and also std for stdout for outfile, so you can pipe from
>> whatever program you like.
>>
>> git clone g...@github.com:htruong/lz4.git for the utility.
>>
>>
>> Cheers, nice weekend,
>> - Huan.
>> --
>> Huan Truong
>> 600-988-9066
>> http://tnhh.net/
>>
>>
>

Reply via email to