Re: SequenceFile compression on Amazon EMR not very good

Zheng Shao Sun, 31 Jan 2010 23:53:31 -0800

I would first check whether it is really the block compression or
record compression.
Also maybe the block size is too small but I am not sure that is
tunable in SequenceFile or not.


Zheng

On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <[email protected]> wrote:
> Hi,
>
> The size of my Gzipped weblog files is about 35MB. However, upon enabling
> block compression, and inserting the logs into another Hive table
> (sequencefile), the file size bloats up to about 233MB. I've done similar
> processing on a local Hadoop/Hive cluster, and while the compressions is not
> as good as gzipping, it still is not this bad. What could be going wrong?
>
> I looked at the header of the resulting file and here's what it says:
>
> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
>
> Does Amazon Elastic MapReduce behave differently or am I doing something
> wrong?
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: SequenceFile compression on Amazon EMR not very good

Reply via email to