On Dec 23, 2010, at 9:20 PM, pig wrote:
For some special reduce jobs that do not rely on the order of (key,value) pairs, the sort phase is of no use. In this situation, theoretically speaking, reduce can be started before all of the map task finished. But why hadoop doesn't support this feature? For example, it may be specified as an argument when committing a job.


Several reasons...

A major problem is errors - a map may fail after it's output has been 'shuffled' by some reduces, not all (i.e. copied by some reduces). In this case, it's really hard to track and discard duplicate key/value pairs.

The behaviour you seek is quite easy to model by running map-only jobs, saving their output to HDFS and processing in the next job - albeit with some performance penalties. But, this keeps the MR framework very simple and stable.

Arun

Reply via email to