Answers inline.

On Wed, Apr 4, 2012 at 4:56 PM, Mohit Anchlia <mohitanch...@gmail.com>wrote:

> I am going through the chapter "How mapreduce works" and have some
> confusion:
>
> 1) Below description of Mapper says that reducers get the output file using
> HTTP call. But the description under "The Reduce Side" doesn't specifically
> say if it's copied using HTTP. So first confusion, Is the output copied
> from mapper -> reducer or from reducer -> mapper? And second, Is the call
> http:// or hdfs://
>

Map output is written to local FS, not HDFS.

>
> 2) My understanding was that mapper output gets written to hdfs, since I've
> seen part-m-00000 files in hdfs. If mapper output is written to HDFS then
> shouldn't reducers simply read it from hdfs instead of making http calls to
> tasktrackers location?
>
> Map output is sent to HDFS when reducer is not used.


>
> ----- from the book ---
> Mapper
> The output file’s partitions are made available to the reducers over HTTP.
> The number of worker threads used to serve the file partitions is
> controlled by the tasktracker.http.threads property
> this setting is per tasktracker, not per map task slot. The default of 40
> may need increasing for large clusters running large jobs.6.4.2.
>
> The Reduce Side
> Let’s turn now to the reduce part of the process. The map output file is
> sitting on the local disk of the tasktracker that ran the map task
> (note that although map outputs always get written to the local disk of the
> map tasktracker, reduce outputs may not be), but now it is needed by the
> tasktracker
> that is about to run the reduce task for the partition. Furthermore, the
> reduce task needs the map output for its particular partition from several
> map tasks across the cluster.
> The map tasks may finish at different times, so the reduce task starts
> copying their outputs as soon as each completes. This is known as the copy
> phase of the reduce task.
> The reduce task has a small number of copier threads so that it can fetch
> map outputs in parallel.
> The default is five threads, but this number can be changed by setting the
> mapred.reduce.parallel.copies property.
>

Reply via email to