That is what I figured. I saw the "ChainedReducer" class and got all excited about chaining multiple reducers; but that is not what this class does.
I see your point about it being in general impossible to chain reducers. Chained mappers are OK, since each is emitting K-V pairs, there is no need for any to have a full global set of all keys or values. But chaining reducers is much more difficult, as each reducer needs the full set of values for each key, which can't be known by a single reducer instance. Thanks From: mapreduce-user-return-315-michael.clements=disney....@hadoop.apache.org [mailto:mapreduce-user-return-315-michael.clements=disney....@hadoop.apa che.org] On Behalf Of Amogh Vasekar Sent: Thursday, January 21, 2010 4:17 AM To: mapreduce-user@hadoop.apache.org Subject: Re: chained mappers & reducers Unless you can somehow guarantee that a certain output key K1 comes only from reducer R1 ( seems very unlikely & somewhat useless in your case ) , I'm afraid you'll need to have a subsequent MR job. The thing is Hadoop has no "in-built" mechanism for reducers to exchange data :) Amogh On 1/21/10 12:30 AM, "Clements, Michael" <michael.cleme...@disney.com> wrote: The use case is this: M1-R1-R2 M1: generate K1-V1 pairs from input R1: group by K1, generate new Keys K2 from group, with value V2, a count M2: identity pass-through R2: sum counts by K2 In short, R1 does this: groups data by the K1 defined by M1 emits new keys K2, derived from the group it built each key K2 has a count R2 sums the counts for each K2 The output of R1 could be fed directly into R2. But I can't find a way to do that in Hadoop. So I have to create a second job, which has to have a Map phase, so I create a pass-through mapper. This works but it has a lot of overhead. It would be faster & cleaner to run R1 directly into R2 within the same job - if possible. From: mapreduce-user-return-302-michael.clements=disney....@hadoop.apache.org [mailto:mapreduce-user-return-302-michael.clements=disney....@hadoop.apa che.org] On Behalf Of Amogh Vasekar Sent: Tuesday, January 19, 2010 10:53 PM To: mapreduce-user@hadoop.apache.org Subject: Re: chained mappers & reducers Hi, Can you elaborate on your case a little? If you need sort and shuffle ( ie outputs of different reducer tasks of R1 to be aggregated in some way ) , you have to write another map-red job. If you need to process only local reducer data ( ie your reducer output key is same as input key ), your job would be M1-R1-M2. Essentially in Hadoop, you can have one sort and shuffle phase in one job. Note that chain APIs are for jobs of the form (M+RM*). Amogh On 1/20/10 2:29 AM, "Clements, Michael" <michael.cleme...@disney.com> wrote: These two classes are not really symmetric as the name suggests. ChainedMapper does what I expected: chains multiple map steps. But ChainedReducer does not chain reducer steps. It chains map steps to follow a reduce step. At least, that is my understanding given the API docs & examples I've read. Is there a way to chain multiple reducer steps? I've got a job that needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2, where M2 is an identity pass-through mapper. If there were a way to chain 2 reduce steps the way ChainedMapper chains map steps, I could make this into a one-pass job, eliminating the overhead of a second job and all the unnecessary I/O. Thanks Michael Clements Solutions Architect michael.cleme...@disney.com 206 664-4374 office 360 317 5051 mobile