Re: Should splittable Gzip be a "core" hadoop feature?

Niels Basjes Wed, 29 Feb 2012 13:17:35 -0800

Hi,

On Wed, Feb 29, 2012 at 19:13, Robert Evans <ev...@yahoo-inc.com> wrote:



> What I really want to know is how well does this new CompressionCodec
> perform in comparison to the regular gzip codec in

various different conditions and what type of impact does it have on
> network traffic and datanode load.  My gut feeling is that

the speedup is going to be relatively small except when there is a lot of
> computation happening in the mapper


I agree, I made the same assesment.
In the javadoc I wrote under "When is this useful?"
*"Assume you have a heavy map phase for which the input is a 1GiB Apache
httpd logfile. Now assume this map takes 60 minutes of CPU time to run."*


> and the added load and network traffic outweighs the speedup in most
> cases,


No, the trick to solve that one is to upload the gzipped files with a HDFS
blocksize equal (or 1 byte larger) than the filesize.
This setting will help in speeding up Gzipped input files in any situation
(no more network overhead).
>From there the HDFS file replication factor of the file dictates the
optimal number of splits for this codec.


> but like all performance on a complex system gut feelings are

almost worthless and hard numbers are what is needed to make a judgment
> call.


Yes


> Niels, I assume you have tested this on your cluster(s).  Can you share
> with us some of the numbers?
>

No I haven't tested it beyond a multiple core system.
The simple reason for that is that when this was under review last summer
the whole "Yarn" thing happened
and I was unable to run it at all for a long time.
I only got it running again last december when the restructuring of the
source tree was mostly done.

At this moment I'm building a experimentation setup at work that can be
used for various things.
Given the current state of Hadoop 2.0 I think it's time to produce some
actual results.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Should splittable Gzip be a "core" hadoop feature?

Reply via email to