Re: SequenceFile compression on Amazon EMR not very good

Zheng Shao Fri, 19 Feb 2010 10:09:29 -0800

hive.exec.compress.output controls whether or not to compress hive
output. (This overrides mapred.output.compress in Hive).


All other compression flags are from hadoop. Please see
http://hadoop.apache.org/common/docs/r0.18.0/hadoop-default.html

Zheng

On Fri, Feb 19, 2010 at 5:53 AM, Saurabh Nanda <[email protected]> wrote:
> And also hive.exec.compress.*. So that makes it three sets of configuration
> variables:
>
> mapred.output.compress.*
> io.seqfile.compress.*
> hive.exec.compress.*
>
> What's the relationship between these configuration parameters and which
> ones should I set to achieve a well compress output table?
>
> Saurabh.
>
> On Fri, Feb 19, 2010 at 7:16 PM, Saurabh Nanda <[email protected]>
> wrote:
>>
>> I'm confused here Zheng. There are two sets of configuration variables.
>> Those starting with io.* and those starting with mapred.*. For making sure
>> that the final output table is compressed, which ones do I have to set?
>>
>> Saurabh.
>>
>> On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao <[email protected]> wrote:
>>>
>>> Did you also:
>>>
>>> SET mapred.output.compression.codec=org.apache....GZipCode;
>>>
>>> Zheng
>>>
>>> On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <[email protected]>
>>> wrote:
>>> > Hi Zheng,
>>> >
>>> > I cross checked. I am setting the following in my Hive script before
>>> > the
>>> > INSERT command:
>>> >
>>> > SET io.seqfile.compression.type=BLOCK;
>>> > SET hive.exec.compress.output=true;
>>> >
>>> > A 132 MB (gzipped) input file going through a cleanup and getting
>>> > populated
>>> > in a sequencefile table is growing to 432 MB. What could be going
>>> > wrong?
>>> >
>>> > Saurabh.
>>> >
>>> > On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <[email protected]>
>>> > wrote:
>>> >>
>>> >> Thanks, Zheng. Will do some more tests and get back.
>>> >>
>>> >> Saurabh.
>>> >>
>>> >> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <[email protected]> wrote:
>>> >>>
>>> >>> I would first check whether it is really the block compression or
>>> >>> record compression.
>>> >>> Also maybe the block size is too small but I am not sure that is
>>> >>> tunable in SequenceFile or not.
>>> >>>
>>> >>> Zheng
>>> >>>
>>> >>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda
>>> >>> <[email protected]>
>>> >>> wrote:
>>> >>> > Hi,
>>> >>> >
>>> >>> > The size of my Gzipped weblog files is about 35MB. However, upon
>>> >>> > enabling
>>> >>> > block compression, and inserting the logs into another Hive table
>>> >>> > (sequencefile), the file size bloats up to about 233MB. I've done
>>> >>> > similar
>>> >>> > processing on a local Hadoop/Hive cluster, and while the
>>> >>> > compressions
>>> >>> > is not
>>> >>> > as good as gzipping, it still is not this bad. What could be going
>>> >>> > wrong?
>>> >>> >
>>> >>> > I looked at the header of the resulting file and here's what it
>>> >>> > says:
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
>>> >>> >
>>> >>> > Does Amazon Elastic MapReduce behave differently or am I doing
>>> >>> > something
>>> >>> > wrong?
>>> >>> >
>>> >>> > Saurabh.
>>> >>> > --
>>> >>> > http://nandz.blogspot.com
>>> >>> > http://foodieforlife.blogspot.com
>>> >>> >
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Yours,
>>> >>> Zheng
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> http://nandz.blogspot.com
>>> >> http://foodieforlife.blogspot.com
>>> >
>>> >
>>> >
>>> > --
>>> > http://nandz.blogspot.com
>>> > http://foodieforlife.blogspot.com
>>> >
>>>
>>>
>>>
>>> --
>>> Yours,
>>> Zheng
>>
>>
>>
>> --
>> http://nandz.blogspot.com
>> http://foodieforlife.blogspot.com
>
>
>
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: SequenceFile compression on Amazon EMR not very good

Reply via email to