Hi, I believe currently a combiner is not run unless you have atleast one reducer set. Not getting into the Hadoop-18 semantics of combiner running on both sides ( the number of reducers are anyways 0, so I guess the merge-combine doesn't come into picture at all) , I have a use case where I would like to run a combiner without a reducer. Basically the aggregation ( a lookup sort of thing ) I do is dependent on a relatively small dataset, and the aggregation is independent of records in the map input data forming the input dataset, and hence the motivation for combine-without-reduce. What I wanted to do was aggregate the similar records in the combiner ( or particular instance of combiner ) in a single shot, this forming my output. This would save me from the amount of intermediate I/O involved in S&S phase at some partial I/O cost on the map + combine side, and I just wanted to try it out to see if its feasible at all. Given combiner w/o reducer is not supported, I was thinking of doing it in a similar way Hadoop would do : create a buffer, sort, combine as I flush. Any thoughts on this would be really helpful.
Thanks, Amogh