Re: When a Reduce Task starts?

Arun C Murthy Tue, 04 Jan 2011 00:08:32 -0800


On Dec 23, 2010, at 9:20 PM, pig wrote:

For some special reduce jobs that do not rely on the order of(key,value) pairs, the sort phase is of no use.In this situation, theoretically speaking, reduce can be startedbefore all of the map task finished.But why hadoop doesn't support this feature? For example, it may bespecified as an argument when committing a job.


Several reasons...

A major problem is errors - a map may fail after it's output has been'shuffled' by some reduces, not all (i.e. copied by some reduces). Inthis case, it's really hard to track and discard duplicate key/valuepairs.

The behaviour you seek is quite easy to model by running map-onlyjobs, saving their output to HDFS and processing in the next job -albeit with some performance penalties. But, this keeps the MRframework very simple and stable.


Arun

Re: When a Reduce Task starts?

Reply via email to