I see. I was looking into tasktracker log :).

Thanks a lot,
Robert



________________________________
 From: Harsh J <ha...@cloudera.com>
To: Grandl Robert <rgra...@yahoo.com>; mapreduce-user 
<mapreduce-user@hadoop.apache.org> 
Sent: Sunday, July 8, 2012 9:16 PM
Subject: Re: Basic question on how reducer works
 
The changes should appear in your Task's userlogs (not the TaskTracker
logs). Have you deployed your changed code properly (i.e. do you
generate a new tarball, or perhaps use the MRMiniCluster to do this)?

On Mon, Jul 9, 2012 at 4:57 AM, Grandl Robert <rgra...@yahoo.com> wrote:
> Hi Harsh,
>
> Your comments were extremely helpful.
>
> Still I am wondering why if I add LOG.info entries into MapTask.java or
> ReduceTask.java in most of the functions(including Old/NewOutputCollector),
> the logs are not shown. In this way it's hard for me to track which
> functions are called and which not. Even more in ReduceTask.java.
>
> Do you have any ideas ?
>
> Thanks a lot for your answer,
> Robert
>
> ________________________________
> From: Harsh J <ha...@cloudera.com>
> To: mapreduce-user@hadoop.apache.org; Grandl Robert <rgra...@yahoo.com>
> Sent: Sunday, July 8, 2012 1:34 AM
>
> Subject: Re: Basic question on how reducer works
>
> Hi Robert,
>
> Inline. (Answer is specific to Hadoop 1.x since you asked for that
> alone, but certain things may vary for Hadoop 2.x).
>
> On Sun, Jul 8, 2012 at 7:07 AM, Grandl Robert <rgra...@yahoo.com> wrote:
>> Hi,
>>
>> I have some questions related to basic functionality in Hadoop.
>>
>> 1. When a Mapper process the intermediate output data, how it knows how
>> many
>> partitions to do(how many reducers will be) and how much data to go in
>> each
>> partition for each reducer ?
>
> The number of reducers is non-dynamic and is user-specified, and is
> set in the job configuration. Hence the Partitioner knows about the
> value it needs to use for its numPartitions (== numReduces for the
> job).
>
> For this one in 1.x code, look at MapTask.java, in the constructors of
> internal classes OldOutputCollector (Stable API) and
> NewOutputCollector (New API).
>
> The data estimated to be going into a partition, for limit/scheduling
> checks, is currently a naive computation, done by summing upon the
> estimate output sizes of each map. See
> ResourceEstimator#getEstimatedReduceInputSize for the overall
> estimation across maps, and see Task#calculateOutputSize for the
> per-map estimation code.
>
>> 2. A JobTracker when assigns a task to a reducer, it will also specify the
>> locations of intermediate output data where it should retrieve it right ?
>> But how a reducer will know from each remote location with intermediate
>> output what portion it has to retrieve only ?
>
> The JT does not send in the information of locations when a reduce is
> scheduled. When the reducers begin their shuffle phase, they query the
> TaskTracker to get the map completion events, via
> TaskTracker#getMapCompletionEvents protocol call. The TaskTracker by
> itself calls the JobTracker#getTaskCompletionEvents protocol call to
> get this info underneath. The returned structure carries the host that
> has completed the map successfully, which the Reduce's copier relies
> on to fetch the data from the right host's TT.
>
> The reduce merely asks the data assigned for it for the specific
> completed maps at each TT. Note that a reduce task ID is also its
> partition ID, so it merely has to ask the data for its own task ID #
> and the TT serves, over HTTP, the right parts of the intermediate data
> to it.
>
> Feel free to ping back if you need some more clarification! :)
>
> --
> Harsh J
>
>



-- 
Harsh J

Reply via email to