Re: chained mappers & reducers

Amogh Vasekar Thu, 21 Jan 2010 04:17:38 -0800

Unless you can somehow guarantee that a certain output key K1 comes only from 
reducer R1 ( seems very unlikely & somewhat useless in your case ) , I'm afraid 
you'll need to have a subsequent MR job. The thing is Hadoop has no "in-built" 
mechanism for reducers to exchange data :)


Amogh


On 1/21/10 12:30 AM, "Clements, Michael" <michael.cleme...@disney.com> wrote:

The use case is this: M1-R1-R2

M1: generate K1-V1 pairs from input
R1: group by K1, generate new Keys K2 from group, with value V2, a count

M2: identity pass-through
R2: sum counts by K2

In short, R1 does this:
groups data by the K1 defined by M1
emits new keys K2, derived from the group it built
each key K2 has a count

R2 sums the counts for each K2

The output of R1 could be fed directly into R2. But I can't find a way to do 
that in Hadoop. So I have to create a second job, which has to have a Map 
phase, so I create a pass-through mapper. This works but it has a lot of 
overhead. It would be faster & cleaner to run R1 directly into R2 within the 
same job - if possible.



From: mapreduce-user-return-302-michael.clements=disney....@hadoop.apache.org 
[mailto:mapreduce-user-return-302-michael.clements=disney....@hadoop.apache.org]
 On Behalf Of Amogh Vasekar
Sent: Tuesday, January 19, 2010 10:53 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: chained mappers & reducers

Hi,
Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of R1 to 
be aggregated in some way ) , you have to write another map-red job. If you 
need to process only local reducer data ( ie your reducer output key is same as 
input key ),  your job would be M1-R1-M2. Essentially in Hadoop, you can have 
one sort and shuffle phase in one job.
Note that chain APIs are for jobs of the form (M+RM*).

Amogh


On 1/20/10 2:29 AM, "Clements, Michael" <michael.cleme...@disney.com> wrote:
These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.

Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.

Thanks

Michael Clements
Solutions Architect
michael.cleme...@disney.com
206 664-4374 office
360 317 5051 mobile

Re: chained mappers & reducers

Reply via email to