Unless you can somehow guarantee that a certain output key K1 comes only from reducer R1 ( seems very unlikely & somewhat useless in your case ) , I'm afraid you'll need to have a subsequent MR job. The thing is Hadoop has no "in-built" mechanism for reducers to exchange data :)
Amogh On 1/21/10 12:30 AM, "Clements, Michael" <michael.cleme...@disney.com> wrote: The use case is this: M1-R1-R2 M1: generate K1-V1 pairs from input R1: group by K1, generate new Keys K2 from group, with value V2, a count M2: identity pass-through R2: sum counts by K2 In short, R1 does this: groups data by the K1 defined by M1 emits new keys K2, derived from the group it built each key K2 has a count R2 sums the counts for each K2 The output of R1 could be fed directly into R2. But I can't find a way to do that in Hadoop. So I have to create a second job, which has to have a Map phase, so I create a pass-through mapper. This works but it has a lot of overhead. It would be faster & cleaner to run R1 directly into R2 within the same job - if possible. From: mapreduce-user-return-302-michael.clements=disney....@hadoop.apache.org [mailto:mapreduce-user-return-302-michael.clements=disney....@hadoop.apache.org] On Behalf Of Amogh Vasekar Sent: Tuesday, January 19, 2010 10:53 PM To: mapreduce-user@hadoop.apache.org Subject: Re: chained mappers & reducers Hi, Can you elaborate on your case a little? If you need sort and shuffle ( ie outputs of different reducer tasks of R1 to be aggregated in some way ) , you have to write another map-red job. If you need to process only local reducer data ( ie your reducer output key is same as input key ), your job would be M1-R1-M2. Essentially in Hadoop, you can have one sort and shuffle phase in one job. Note that chain APIs are for jobs of the form (M+RM*). Amogh On 1/20/10 2:29 AM, "Clements, Michael" <michael.cleme...@disney.com> wrote: These two classes are not really symmetric as the name suggests. ChainedMapper does what I expected: chains multiple map steps. But ChainedReducer does not chain reducer steps. It chains map steps to follow a reduce step. At least, that is my understanding given the API docs & examples I've read. Is there a way to chain multiple reducer steps? I've got a job that needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2, where M2 is an identity pass-through mapper. If there were a way to chain 2 reduce steps the way ChainedMapper chains map steps, I could make this into a one-pass job, eliminating the overhead of a second job and all the unnecessary I/O. Thanks Michael Clements Solutions Architect michael.cleme...@disney.com 206 664-4374 office 360 317 5051 mobile