John M Cieslewicz wrote:
A summary of desirable semantics:
1 The map function produces as output partial aggregate values
representing singletons.
2 A new combiner function that explicitly performs partial to partial
aggregation over one or more values, creating one new output value of
the same type as the input value and not changing the key.
3 A reducer which takes as input partial aggregates and produces final
values of any format.
This seems consistent with all uses of combiners I have seen. It seems
we could start by simply adding a few tests:
1. That combiners do not alter keys;
2. That combiners always output a single value;
Then, subsequently we can add optimizations that take advantage of this.
If we decide to go this direction, it might be worth adding these
tests sooner rather than later.
Alternately, we could add a new Combiner interface that enforces these:
public interface Combiner {
/** Combine all values passed into a single value that is returned. */
public Writable combine(WritableComparable key, Iterator values);
}
We should also make it clear to programmers that a values may be
combined more than once between map and reduce. Besides documentation,
a good way to make folks aware of that is to implement it.
I am currently working to get some performance numbers related to pushing
aggregation into the copying and merging phase of reduce.
Please share your results when you have them!
Thanks,
Doug