On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: > Thanks a lot guys for answers. > > Still I am not able to find exactly the code for the following things: > > 1. reducer to read from a Map output only its partition. I looked into > ReduceTask#getMapOutput which do the actual read in > ReduceTask#shuffleInMemory, but I don't see where it specify which partition > to read(reduceID). >
Look at TaskTracker.MapOutputServlet. > 2. still don't understand very well in which part of the code(MapTask.java) > the intermediate data is written do which partition. So MapOutputBuffer is > the one who actually writes the data to buffer and spill after buffer is > full. Could you please elaborate a bit on how the data is written to which > partition ? > Essentially you can think of the partition-id as the 'primary key' and the actual 'key' in the map-output of <key, value> as the 'secondary key'. hth, Arun > Thanks, > Robert > > From: Arun C Murthy <a...@hortonworks.com> > To: mapreduce-user@hadoop.apache.org > Sent: Monday, July 9, 2012 9:24 AM > Subject: Re: Basic question on how reducer works > > Robert, > > On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: > >> Hi, >> >> I have some questions related to basic functionality in Hadoop. >> >> 1. When a Mapper process the intermediate output data, how it knows how many >> partitions to do(how many reducers will be) and how much data to go in each >> partition for each reducer ? >> >> 2. A JobTracker when assigns a task to a reducer, it will also specify the >> locations of intermediate output data where it should retrieve it right ? >> But how a reducer will know from each remote location with intermediate >> output what portion it has to retrieve only ? > > To add to Harsh's comment. Essentially the TT *knows* where the output of a > given map-id/reduce-id pair is present via an output-file/index-file > combination. > > Arun > > -- > Arun C. Murthy > Hortonworks Inc. > http://hortonworks.com/ > > > > -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/