Re: Should splittable Gzip be a "core" hadoop feature?

Michel Segel Thu, 01 Mar 2012 04:35:17 -0800

 I do agree that a git hub project is the way to go unless you could convince 
Cloudera, HortonWorks or MapR to pick it up and support it.  They have enough 
committers


Is this potentially worthwhile? Maybe, it depends on how the cluster is 
integrated in to the overall environment. Companies that have standardized on 
using gzip would find it useful.



Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 29, 2012, at 3:17 PM, Niels Basjes <ni...@basjes.nl> wrote:

> Hi,
> 
> On Wed, Feb 29, 2012 at 19:13, Robert Evans <ev...@yahoo-inc.com> wrote:
> 
> 
>> What I really want to know is how well does this new CompressionCodec
>> perform in comparison to the regular gzip codec in
> 
> various different conditions and what type of impact does it have on
>> network traffic and datanode load.  My gut feeling is that
> 
> the speedup is going to be relatively small except when there is a lot of
>> computation happening in the mapper
> 
> 
> I agree, I made the same assesment.
> In the javadoc I wrote under "When is this useful?"
> *"Assume you have a heavy map phase for which the input is a 1GiB Apache
> httpd logfile. Now assume this map takes 60 minutes of CPU time to run."*
> 
> 
>> and the added load and network traffic outweighs the speedup in most
>> cases,
> 
> 
> No, the trick to solve that one is to upload the gzipped files with a HDFS
> blocksize equal (or 1 byte larger) than the filesize.
> This setting will help in speeding up Gzipped input files in any situation
> (no more network overhead).
> From there the HDFS file replication factor of the file dictates the
> optimal number of splits for this codec.
> 
> 
>> but like all performance on a complex system gut feelings are
> 
> almost worthless and hard numbers are what is needed to make a judgment
>> call.
> 
> 
> Yes
> 
> 
>> Niels, I assume you have tested this on your cluster(s).  Can you share
>> with us some of the numbers?
>> 
> 
> No I haven't tested it beyond a multiple core system.
> The simple reason for that is that when this was under review last summer
> the whole "Yarn" thing happened
> and I was unable to run it at all for a long time.
> I only got it running again last december when the restructuring of the
> source tree was mostly done.
> 
> At this moment I'm building a experimentation setup at work that can be
> used for various things.
> Given the current state of Hadoop 2.0 I think it's time to produce some
> actual results.
> 
> -- 
> Best regards / Met vriendelijke groeten,
> 
> Niels Basjes

Re: Should splittable Gzip be a "core" hadoop feature?

Reply via email to