On Aug 6, 2007, at 10:12 PM, novice user wrote:

In reduce phase, with outputValueGroupingComparator, we can sort all keys and then group values of a particular key together and send it to reduce() method. Is there a way to sort values of a particular key efficiently before
it reaches to reduce method?

There are two comparators that are used for sorting for precisely this purpose. In particular:

JobConf.getOutputKeyComparator()
JobConf.getOutputValueGroupingComparator()

The first controls the sort and the second is used to control which keys are a single call to reduce.

Therefore, if your data has primary key K1 and secondary K2:

class MyKey implements WritableComparable {
  K1 primary;
  K2 secondary;
  ...
}

you make the map output key MyKey and the OutputKeyComparator uses both primary and secondary to pick the order. The OuputValueGroupingComparator would just compare the primary keys for equality. So if your data looked like:

K1(1), K2(1), V1
K1(1), K2(2), V2
K1(2), K2(1), V3
K1(2), K2(2), V4

the records would be sorted as above, but the reduce would see two calls once with K1(1) with values V1 and V2 and once with K1(2) with values V3 and V4.

-- Owen

PS. The OutputValueGroupingComparator is a bad name. It should be OutputKeyGroupingComparator or something.

Reply via email to