The use case is this: M1-R1-R2

 

M1: generate K1-V1 pairs from input

R1: group by K1, generate new Keys K2 from group, with value V2, a count

 

M2: identity pass-through

R2: sum counts by K2

 

In short, R1 does this:

groups data by the K1 defined by M1

emits new keys K2, derived from the group it built

each key K2 has a count

 

R2 sums the counts for each K2

 

The output of R1 could be fed directly into R2. But I can't find a way
to do that in Hadoop. So I have to create a second job, which has to
have a Map phase, so I create a pass-through mapper. This works but it
has a lot of overhead. It would be faster & cleaner to run R1 directly
into R2 within the same job - if possible.

 

 

From:
mapreduce-user-return-302-michael.clements=disney....@hadoop.apache.org
[mailto:mapreduce-user-return-302-michael.clements=disney....@hadoop.apa
che.org] On Behalf Of Amogh Vasekar
Sent: Tuesday, January 19, 2010 10:53 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: chained mappers & reducers

 

Hi,
Can you elaborate on your case a little?
If you need sort and shuffle ( ie outputs of different reducer tasks of
R1 to be aggregated in some way ) , you have to write another map-red
job. If you need to process only local reducer data ( ie your reducer
output key is same as input key ),  your job would be M1-R1-M2.
Essentially in Hadoop, you can have one sort and shuffle phase in one
job.
Note that chain APIs are for jobs of the form (M+RM*).  

Amogh


On 1/20/10 2:29 AM, "Clements, Michael" <michael.cleme...@disney.com>
wrote:

These two classes are not really symmetric as the name suggests.
ChainedMapper does what I expected: chains multiple map steps. But
ChainedReducer does not chain reducer steps. It chains map steps to
follow a reduce step. At least, that is my understanding given the API
docs & examples I've read.

Is there a way to chain multiple reducer steps? I've got a job that
needs a M-R1-R2. It currently has 2 phases: M1-R1 followed by M2-R2,
where M2 is an identity pass-through mapper. If there were a way to
chain 2 reduce steps the way ChainedMapper chains map steps, I could
make this into a one-pass job, eliminating the overhead of a second job
and all the unnecessary I/O.

Thanks

Michael Clements
Solutions Architect
michael.cleme...@disney.com
206 664-4374 office
360 317 5051 mobile




Reply via email to