Re: SequenceFile compression on Amazon EMR not very good

Saurabh Nanda Thu, 18 Feb 2010 08:26:11 -0800

Hi Zheng,

I cross checked. I am setting the following in my Hive script before the
INSERT command:


SET io.seqfile.compression.type=BLOCK;
SET hive.exec.compress.output=true;

A 132 MB (gzipped) input file going through a cleanup and getting populated
in a sequencefile table is growing to 432 MB. What could be going wrong?

Saurabh.

On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <[email protected]>wrote:

> Thanks, Zheng. Will do some more tests and get back.
>
> Saurabh.
>
>
> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <[email protected]> wrote:
>
>> I would first check whether it is really the block compression or
>> record compression.
>> Also maybe the block size is too small but I am not sure that is
>> tunable in SequenceFile or not.
>>
>> Zheng
>>
>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <[email protected]>
>> wrote:
>> > Hi,
>> >
>> > The size of my Gzipped weblog files is about 35MB. However, upon
>> enabling
>> > block compression, and inserting the logs into another Hive table
>> > (sequencefile), the file size bloats up to about 233MB. I've done
>> similar
>> > processing on a local Hadoop/Hive cluster, and while the compressions is
>> not
>> > as good as gzipping, it still is not this bad. What could be going
>> wrong?
>> >
>> > I looked at the header of the resulting file and here's what it says:
>> >
>> >
>> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
>> >
>> > Does Amazon Elastic MapReduce behave differently or am I doing something
>> > wrong?
>> >
>> > Saurabh.
>> > --
>> > http://nandz.blogspot.com
>> > http://foodieforlife.blogspot.com
>> >
>>
>>
>>
>> --
>> Yours,
>> Zheng
>>
>
>
>
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: SequenceFile compression on Amazon EMR not very good

Reply via email to