Re: Reduce task attempt retry strategy

Amar Kamat Mon, 06 Apr 2009 21:35:57 -0700

Stefan Will wrote:

Hi,


I had a flaky machine the other day that was still accepting jobs and
sending heartbeats, but caused all reduce task attempts to fail. This in
turn caused the whole job to fail because the same reduce task was retried 3
times on that particular machine.

What is your cluster size? If a task fails on a machine then itsre-tried on some other machine (based on number of good machines left inthe cluster). After certain number of failures, the machine will beblacklisted (again based on number of machine left in the cluster). 3different reducers might be scheduled on that machine but that shouldnot lead to job failure. Can you explain in detail what exactlyhappened. Find out where the attempts got scheduled from thejobtracker's log.

Amar

Perhaps I¹m confusing this with the block placement strategy in hdfs, but I
always thought that the framework would retry jobs on a different machine if
retries on the original machine keep failing. E.g. I would have expected to
retry once or twice on the same machine, but then switch to a different one
to minimize the likelihood of getting stuck on a bad machine.

What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for
improving on this in the future ?

Thanks,
Stefan

Re: Reduce task attempt retry strategy

Reply via email to