I cannot imagine there is such a huge compression ratio difference. On our side, the compression ratio of gzip and GzipCodec (BLOCK) are within 10% relative difference. Log file compression ratio is usually 5x to 15x, so 250MB looks like a good one.
The 1600MB number looks like record-level compression. Are you sure you've turned on block compression? Zheng On Mon, Jul 27, 2009 at 8:38 AM, Saurabh Nanda<[email protected]> wrote: > >> #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8 >> compressed files) >> #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB >> over 126 compressed files) > > Why is there such a *big* difference in compression ratios between the gzip > utility and Hive? > > Uncompressed file size: approx 3500 MB > Gzip utility: approx 250 MB > org.apache.hadoop.io.compress.GzipCodec (BLOCK): approx 1600 MB > org.apache.hadoop.io.compress.DefaultCodec (BLOCK): approx 1700 MB > > Saurabh. > -- > http://nandz.blogspot.com > http://foodieforlife.blogspot.com > -- Yours, Zheng
