Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?

Jonathan Coveney Tue, 22 May 2012 17:13:25 -0700

I was referred here by Alan Gates (I'm a committer on the Pig project). I've
been dealing some with the intermediate serialization of Pi objects. When
serializing, there is generally the time to serialize vs. space on disk
tradeoff (an extreme example being compression vs. no compression, a more
nuanced one being varint vs full int, that sort of thing). With Hadoop,
generally network io is the bottleneck, but I'm not sure of the best way to
evaluate something like: method X takes 3x as long to serialize, but is
potentially 1/2 as large on disk.


Does anyone have a good framework for evaluating (or at least benchmarking)
the tradeoff between CPU efficiency and shuffle/sort?

Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?

Reply via email to