I see. I was looking into tasktracker log :). Thanks a lot, Robert
________________________________ From: Harsh J <ha...@cloudera.com> To: Grandl Robert <rgra...@yahoo.com>; mapreduce-user <mapreduce-user@hadoop.apache.org> Sent: Sunday, July 8, 2012 9:16 PM Subject: Re: Basic question on how reducer works The changes should appear in your Task's userlogs (not the TaskTracker logs). Have you deployed your changed code properly (i.e. do you generate a new tarball, or perhaps use the MRMiniCluster to do this)? On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <rgra...@yahoo.com> wrote: > Hi Harsh, > > Your comments were extremely helpful. > > Still I am wondering why if I add LOG.info entries into MapTask.java or > ReduceTask.java in most of the functions(including Old/NewOutputCollector), > the logs are not shown. In this way it's hard for me to track which > functions are called and which not. Even more in ReduceTask.java. > > Do you have any ideas ? > > Thanks a lot for your answer, > Robert > > ________________________________ > From: Harsh J <ha...@cloudera.com> > To: mapreduce-user@hadoop.apache.org; Grandl Robert <rgra...@yahoo.com> > Sent: Sunday, July 8, 2012 1:34 AM > > Subject: Re: Basic question on how reducer works > > Hi Robert, > > Inline. (Answer is specific to Hadoop 1.x since you asked for that > alone, but certain things may vary for Hadoop 2.x). > > On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <rgra...@yahoo.com> wrote: >> Hi, >> >> I have some questions related to basic functionality in Hadoop. >> >> 1. When a Mapper process the intermediate output data, how it knows how >> many >> partitions to do(how many reducers will be) and how much data to go in >> each >> partition for each reducer ? > > The number of reducers is non-dynamic and is user-specified, and is > set in the job configuration. Hence the Partitioner knows about the > value it needs to use for its numPartitions (== numReduces for the > job). > > For this one in 1.x code, look at MapTask.java, in the constructors of > internal classes OldOutputCollector (Stable API) and > NewOutputCollector (New API). > > The data estimated to be going into a partition, for limit/scheduling > checks, is currently a naive computation, done by summing upon the > estimate output sizes of each map. See > ResourceEstimator#getEstimatedReduceInputSize for the overall > estimation across maps, and see Task#calculateOutputSize for the > per-map estimation code. > >> 2. A JobTracker when assigns a task to a reducer, it will also specify the >> locations of intermediate output data where it should retrieve it right ? >> But how a reducer will know from each remote location with intermediate >> output what portion it has to retrieve only ? > > The JT does not send in the information of locations when a reduce is > scheduled. When the reducers begin their shuffle phase, they query the > TaskTracker to get the map completion events, via > TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by > itself calls the JobTracker#getTaskCompletionEvents protocol call to > get this info underneath. The returned structure carries the host that > has completed the map successfully, which the Reduce's copier relies > on to fetch the data from the right host's TT. > > The reduce merely asks the data assigned for it for the specific > completed maps at each TT. Note that a reduce task ID is also its > partition ID, so it merely has to ask the data for its own task ID # > and the TT serves, over HTTP, the right parts of the intermediate data > to it. > > Feel free to ping back if you need some more clarification! :) > > -- > Harsh J > > -- Harsh J