I cannot imagine there is such a huge compression ratio difference. On
our side, the compression ratio of gzip and GzipCodec (BLOCK) are
within 10% relative difference.
Log file compression ratio is usually 5x to 15x, so 250MB looks like a good one.

The 1600MB number looks like record-level compression. Are you sure
you've turned on block compression?

Zheng

On Mon, Jul 27, 2009 at 8:38 AM, Saurabh Nanda<[email protected]> wrote:
>
>> #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8
>> compressed files)
>> #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB
>> over 126 compressed files)
>
> Why is there such a *big* difference in compression ratios between the gzip
> utility and Hive?
>
> Uncompressed file size: approx 3500 MB
> Gzip utility: approx 250 MB
> org.apache.hadoop.io.compress.GzipCodec (BLOCK): approx 1600 MB
> org.apache.hadoop.io.compress.DefaultCodec (BLOCK): approx 1700 MB
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Reply via email to