Hi all,
The short version of my question is in the subject.  Here's the long version:
I have two map/reduce jobs that output records using a common key:

Job A:
K1  =>  A1,1
K1  =>  A1,2
K2  =>  A2,1
K2  =>  A2,2

Job B:
K1  =>  B1
K2  =>  B2
K3  =>  B3

And a third job that merges records with the same key, using
IdentityMapper and a custom Reducer:

Job C:
K1  =>  A1,1; A2,2; B1
K2  =>  A2,1; A2,2; B2
K3  =>  B3

The trouble is, the A's and B's are large (20-30 KB each) and I have a
few million of them.  If Job C has only one Reducer task, it takes
forever to copy and sort all the records.

So here's my question -- does Hadoop guarantee that all records with
the same key will end up in the same Reducer task?  If that's true,
then can I set the number of Reducers very high (even equal to the
number of maps) to make Job C go faster?

Thanks for any enlightenment you can provide here,
-Stuart

Reply via email to