> So here's my question -- does Hadoop guarantee that all records with the same key will end up in the same Reducer task? If that's true, >
yes --think of the record as being sent to the task by hashing over the key Miles 2008/9/19 Stuart Sierra <[EMAIL PROTECTED]>: > Hi all, > The short version of my question is in the subject. Here's the long version: > I have two map/reduce jobs that output records using a common key: > > Job A: > K1 => A1,1 > K1 => A1,2 > K2 => A2,1 > K2 => A2,2 > > Job B: > K1 => B1 > K2 => B2 > K3 => B3 > > And a third job that merges records with the same key, using > IdentityMapper and a custom Reducer: > > Job C: > K1 => A1,1; A2,2; B1 > K2 => A2,1; A2,2; B2 > K3 => B3 > > The trouble is, the A's and B's are large (20-30 KB each) and I have a > few million of them. If Job C has only one Reducer task, it takes > forever to copy and sort all the records. > > So here's my question -- does Hadoop guarantee that all records with > the same key will end up in the same Reducer task? If that's true, > then can I set the number of Reducers very high (even equal to the > number of maps) to make Job C go faster? > > Thanks for any enlightenment you can provide here, > -Stuart > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
