Hi Zheng, I cross checked. I am setting the following in my Hive script before the INSERT command:
SET io.seqfile.compression.type=BLOCK; SET hive.exec.compress.output=true; A 132 MB (gzipped) input file going through a cleanup and getting populated in a sequencefile table is growing to 432 MB. What could be going wrong? Saurabh. On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <[email protected]>wrote: > Thanks, Zheng. Will do some more tests and get back. > > Saurabh. > > > On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <[email protected]> wrote: > >> I would first check whether it is really the block compression or >> record compression. >> Also maybe the block size is too small but I am not sure that is >> tunable in SequenceFile or not. >> >> Zheng >> >> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <[email protected]> >> wrote: >> > Hi, >> > >> > The size of my Gzipped weblog files is about 35MB. However, upon >> enabling >> > block compression, and inserting the logs into another Hive table >> > (sequencefile), the file size bloats up to about 233MB. I've done >> similar >> > processing on a local Hadoop/Hive cluster, and while the compressions is >> not >> > as good as gzipping, it still is not this bad. What could be going >> wrong? >> > >> > I looked at the header of the resulting file and here's what it says: >> > >> > >> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec >> > >> > Does Amazon Elastic MapReduce behave differently or am I doing something >> > wrong? >> > >> > Saurabh. >> > -- >> > http://nandz.blogspot.com >> > http://foodieforlife.blogspot.com >> > >> >> >> >> -- >> Yours, >> Zheng >> > > > > -- > http://nandz.blogspot.com > http://foodieforlife.blogspot.com > -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
