One bad node can cause whole job to fail
----------------------------------------
Key: HADOOP-5547
URL: https://issues.apache.org/jira/browse/HADOOP-5547
Project: Hadoop Core
Issue Type: Bug
Reporter: Nathan Marz
This happened on the 0.19.2 branch. One of the nodes in our cluster was having
disk problems and every task run on it was failing. In general the node would
get blacklisted and jobs would run fine on it. However, for one job, the job
ran the "Job setup" task on this bad node. When the task failed, the task was
then retried on the same bad node 3 more times until the job failed. Hadoop
should be able to handle this situation better.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.