Re: A Case for Stronger Partial Aggregation Semantics in Hadoop

Doug Cutting Mon, 30 Jul 2007 12:55:06 -0700

John M Cieslewicz wrote:

A summary of desirable semantics:
   1 The map function produces as output partial aggregate values
      representing singletons.
   2 A new combiner function that explicitly performs partial to partial
      aggregation over one or more values, creating one new output value of
      the same type as the input value and not changing the key.
   3 A reducer which takes as input partial aggregates and produces final
      values of any format.

This seems consistent with all uses of combiners I have seen. It seemswe could start by simply adding a few tests:


1. That combiners do not alter keys;
2. That combiners always output a single value;

Then, subsequently we can add optimizations that take advantage of this.If we decide to go this direction, it might be worth adding thesetests sooner rather than later.


Alternately, we could add a new Combiner interface that enforces these:

public interface Combiner {
  /** Combine all values passed into a single value that is returned. */
  public Writable combine(WritableComparable key, Iterator values);
}

We should also make it clear to programmers that a values may becombined more than once between map and reduce. Besides documentation,a good way to make folks aware of that is to implement it.

I am currently working to get some performance numbers related to pushing
aggregation into the copying and merging phase of reduce.


Please share your results when you have them!

Thanks,

Doug

Re: A Case for Stronger Partial Aggregation Semantics in Hadoop

Reply via email to