SequenceFile compression on Amazon EMR not very good

Saurabh Nanda Sun, 31 Jan 2010 21:03:49 -0800

 Hi,

The size of my Gzipped weblog files is about 35MB. However, upon enabling
block compression, and inserting the logs into another Hive table
(sequencefile), the file size bloats up to about 233MB. I've done similar
processing on a local Hadoop/Hive cluster, and while the compressions is not
as good as gzipping, it still is not this bad. What could be going wrong?


I looked at the header of the resulting file and here's what it says:

SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec

Does Amazon Elastic MapReduce behave differently or am I doing something
wrong?

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

SequenceFile compression on Amazon EMR not very good

Reply via email to