Hi all, The short version of my question is in the subject. Here's the long version: I have two map/reduce jobs that output records using a common key:
Job A: K1 => A1,1 K1 => A1,2 K2 => A2,1 K2 => A2,2 Job B: K1 => B1 K2 => B2 K3 => B3 And a third job that merges records with the same key, using IdentityMapper and a custom Reducer: Job C: K1 => A1,1; A2,2; B1 K2 => A2,1; A2,2; B2 K3 => B3 The trouble is, the A's and B's are large (20-30 KB each) and I have a few million of them. If Job C has only one Reducer task, it takes forever to copy and sort all the records. So here's my question -- does Hadoop guarantee that all records with the same key will end up in the same Reducer task? If that's true, then can I set the number of Reducers very high (even equal to the number of maps) to make Job C go faster? Thanks for any enlightenment you can provide here, -Stuart
