Hi hadoop Gurus, here's a question about intermediate compression. As I understand, the point to compress Map output is to reduce network traffic that occur when feeding sequence files from Map to Reduce tasks which do not reside on the same boxes as Map tasks. So, depending on various factors such as how cluster is set up, data size, property of problem to solve and the quality of m/r program (e.g. a pig script), etc, this reduce in network traffic (due to compressed data) may or may not compensate the time for compression and decompression. In other words, intermediate compression may not reach its goal to reduce the overall time cost of a m/r job.
As I know, a blog (http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html) gives compression and decompression factor and speed and reports a positive result of using compression on raw data as input to m/r job, but no test or insight about intermediate compression. So, I am wondering if there is any case study or test results guiding when to use intermediate compression, pros and cons, settings, pitfalls and gains... Thanks, Michael
