On Nov 6, 2008, at 11:29 AM, Ricky Ho wrote:

Disk I/O overhead
==================
- The output of a Map task is written to a local disk and then later on upload to the Reduce task. While this enable a simple recovery strategy when the map task failed, it incur additional disk I/O overhead.

That is correct. However, Linux does very well at using extra ram for buffer caches, so as long as you enable write behind it won't be a performance problem. You are right that the primary motivation is both recoverability and not needing the reduces running until after maps finish.

So I am wondering if there is an option to bypassing the step of writing the map result to the local disk.

Currently no.

- In the current setting, it sounds like no reduce task will be started before all map tasks have completed. In case if there are a few slow running map tasks, the whole job will be delayed.

The application's reduce function can't start until the last map finishes because the input to the reduce is sorted. Since the last map may generate the first keys that must be given to the reduce, the reduce must wait.

- The overall job execution can be shortened if the reduce tasks can starts its processing as soon as some map results are available rather than waiting for all the map tasks to complete.

But it would violate the specification of the framework that the input to reduce is completely sorted.

- Therefore it is impossible for the reduce phase of Job1 to stream its output data to a file while the map phase of Job2 start reading the same file. Job2 can only start after ALL REDUCE TASKS of Job1 is completed, which makes pipelining between jobs impossible.

It is currently not supported, but the framework could be extended to let the client add input splits after the job has started. That would remove the hard synchronization between jobs.

- This means the partitioning function has to be chosen carefully to make sure the workload of the reduce processes is balanced. (maybe not a big deal)

Yes, the partitioner must balance the workload between the reduces.

- Is there any thoughts of running a pool of reduce tasks on the same key and have they combine their results later ?

That is called the combiner. It is called multiple times as the data is merged together. See the word count example. If the reducer does data reduction, using combiners is very important for performance.

-- Owen

Reply via email to