On Aug 6, 2007, at 10:12 PM, novice user wrote:
In reduce phase, with outputValueGroupingComparator, we can sort
all keys
and then group values of a particular key together and send it to
reduce()
method. Is there a way to sort values of a particular key
efficiently before
it reaches to reduce method?
There are two comparators that are used for sorting for precisely
this purpose. In particular:
JobConf.getOutputKeyComparator()
JobConf.getOutputValueGroupingComparator()
The first controls the sort and the second is used to control which
keys are a single call to reduce.
Therefore, if your data has primary key K1 and secondary K2:
class MyKey implements WritableComparable {
K1 primary;
K2 secondary;
...
}
you make the map output key MyKey and the OutputKeyComparator uses
both primary and secondary to pick the order. The
OuputValueGroupingComparator would just compare the primary keys for
equality. So if your data looked like:
K1(1), K2(1), V1
K1(1), K2(2), V2
K1(2), K2(1), V3
K1(2), K2(2), V4
the records would be sorted as above, but the reduce would see two
calls once with K1(1) with values V1 and V2 and once with K1(2) with
values V3 and V4.
-- Owen
PS. The OutputValueGroupingComparator is a bad name. It should be
OutputKeyGroupingComparator or something.