Hi, I was looking at how the Combiner gets run inside Hadoop. Since there's no separate Combiner interface, it basically seems to give the user the impression any valid Reducer can be used as a Combiner. There's a little "warning" note added to the docs under the setCombinerClass() method as to what a combiner should not do:
"The framework may invoke the combiner 0, 1, or multiple times, in both the mapper and reducer tasks. In general, the combiner is called as the sort/merge result is written to disk. The combiner must: * be side-effect free * have the same input and output key types and the same input and output value types Typically the combiner is same as the Reducer for the job i.e. setReducerClass(Class)." But after reading through some of the Hadoop code, it seems like there are many more restrictions on what a valid combiner is. Correct me if I'm wrong but a combiner can potentially get run 3 different times: 1) Map-side - when a map-side spill is about to occur, combiner is run inside each partition. 2) Map-side - when multiple spill files are being merged together. 3) Reduce-side - when multiple map-outputs are being merged before the reduce() function is invoked. In all these above cases, Hadoop is making an assumption that: 1) A combiner does not change the key. Apart from the fact that the input and output key types should match (as already mentioned in the Java docs above), it seems like the input and the output key should be exactly the same. Because if the combiner modifies the key, the key basically ends up in the wrong partition (because this new key is never rehashed to another partition) and the entire merge-sort operation is affected downstream. 2) Since the combiner can be run multiple times or not be run at all, it directly implies that the operation that is used to combine multiple values into one value should be "associative & commutative"? Why isn't there a separate Combiner interface? If not, maybe some of these other gotchas should be documented somewhere? I ran a simple experiment - I took the wordcount example and just modified the Reducer such that it outputs the string key "reversed" and then used this reducer as the combiner. It led to some funny results. -- Harish Mallipeddi http://blog.poundbang.in
