Hi hadoop Gurus, here's a question about intermediate compression.

As I understand, the point to compress Map output is to reduce network traffic 
that occur when feeding sequence files from Map to Reduce tasks which do not 
reside on the same boxes as Map tasks. So, depending on various factors such as 
how cluster is set up, data size, property of problem to solve and the quality 
of m/r program (e.g. a pig script), etc, this reduce in network traffic (due to 
compressed data) may or may not compensate the time for compression and 
decompression. In other words, intermediate compression may not reach its goal 
to reduce the overall time cost of a m/r job.

As I know, a blog 
(http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html) 
gives compression and decompression factor and speed and reports a positive 
result of using compression on raw data as input to m/r job, but no test or 
insight about intermediate compression. So, I am wondering if there is any case 
study or test results guiding when to use intermediate compression, pros and 
cons, settings, pitfalls and gains...

Thanks,

Michael


      

Reply via email to