Stefan Will wrote:
Hi,
I had a flaky machine the other day that was still accepting jobs and
sending heartbeats, but caused all reduce task attempts to fail. This in
turn caused the whole job to fail because the same reduce task was retried 3
times on that particular machine.
What is your cluster size? If a task fails on a machine then its
re-tried on some other machine (based on number of good machines left in
the cluster). After certain number of failures, the machine will be
blacklisted (again based on number of machine left in the cluster). 3
different reducers might be scheduled on that machine but that should
not lead to job failure. Can you explain in detail what exactly
happened. Find out where the attempts got scheduled from the
jobtracker's log.
Amar
Perhaps I¹m confusing this with the block placement strategy in hdfs, but I
always thought that the framework would retry jobs on a different machine if
retries on the original machine keep failing. E.g. I would have expected to
retry once or twice on the same machine, but then switch to a different one
to minimize the likelihood of getting stuck on a bad machine.
What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for
improving on this in the future ?
Thanks,
Stefan