Hi, On Wed, Feb 29, 2012 at 19:13, Robert Evans <ev...@yahoo-inc.com> wrote:
> What I really want to know is how well does this new CompressionCodec > perform in comparison to the regular gzip codec in various different conditions and what type of impact does it have on > network traffic and datanode load. My gut feeling is that the speedup is going to be relatively small except when there is a lot of > computation happening in the mapper I agree, I made the same assesment. In the javadoc I wrote under "When is this useful?" *"Assume you have a heavy map phase for which the input is a 1GiB Apache httpd logfile. Now assume this map takes 60 minutes of CPU time to run."* > and the added load and network traffic outweighs the speedup in most > cases, No, the trick to solve that one is to upload the gzipped files with a HDFS blocksize equal (or 1 byte larger) than the filesize. This setting will help in speeding up Gzipped input files in any situation (no more network overhead). >From there the HDFS file replication factor of the file dictates the optimal number of splits for this codec. > but like all performance on a complex system gut feelings are almost worthless and hard numbers are what is needed to make a judgment > call. Yes > Niels, I assume you have tested this on your cluster(s). Can you share > with us some of the numbers? > No I haven't tested it beyond a multiple core system. The simple reason for that is that when this was under review last summer the whole "Yarn" thing happened and I was unable to run it at all for a long time. I only got it running again last december when the restructuring of the source tree was mostly done. At this moment I'm building a experimentation setup at work that can be used for various things. Given the current state of Hadoop 2.0 I think it's time to produce some actual results. -- Best regards / Met vriendelijke groeten, Niels Basjes