Answers inline. On Wed, Apr 4, 2012 at 4:56 PM, Mohit Anchlia <mohitanch...@gmail.com>wrote:
> I am going through the chapter "How mapreduce works" and have some > confusion: > > 1) Below description of Mapper says that reducers get the output file using > HTTP call. But the description under "The Reduce Side" doesn't specifically > say if it's copied using HTTP. So first confusion, Is the output copied > from mapper -> reducer or from reducer -> mapper? And second, Is the call > http:// or hdfs:// > Map output is written to local FS, not HDFS. > > 2) My understanding was that mapper output gets written to hdfs, since I've > seen part-m-00000 files in hdfs. If mapper output is written to HDFS then > shouldn't reducers simply read it from hdfs instead of making http calls to > tasktrackers location? > > Map output is sent to HDFS when reducer is not used. > > ----- from the book --- > Mapper > The output file’s partitions are made available to the reducers over HTTP. > The number of worker threads used to serve the file partitions is > controlled by the tasktracker.http.threads property > this setting is per tasktracker, not per map task slot. The default of 40 > may need increasing for large clusters running large jobs.6.4.2. > > The Reduce Side > Let’s turn now to the reduce part of the process. The map output file is > sitting on the local disk of the tasktracker that ran the map task > (note that although map outputs always get written to the local disk of the > map tasktracker, reduce outputs may not be), but now it is needed by the > tasktracker > that is about to run the reduce task for the partition. Furthermore, the > reduce task needs the map output for its particular partition from several > map tasks across the cluster. > The map tasks may finish at different times, so the reduce task starts > copying their outputs as soon as each completes. This is known as the copy > phase of the reduce task. > The reduce task has a small number of copier threads so that it can fetch > map outputs in parallel. > The default is five threads, but this number can be changed by setting the > mapred.reduce.parallel.copies property. >