[ http://issues.apache.org/jira/browse/HADOOP-654?page=comments#action_12454502 ] Owen O'Malley commented on HADOOP-654: --------------------------------------
The number of failures on each task tracker is already on the web/ui. I propose a very simple mechanism where each job can configure the maximum number of task failures that it will tolerate from a single node. JobConf gets: void setMaxTaskFailuresPerHost(int newValue); int getMaxTaskFailuresPerHost(); JobInProgress keeps a Map from task tracker id to counts of failures. Each task failure increments the count for the responsible tracker. (Should lost task trackers count as 1 failures or lots of failures?) When the tracker is asking for new tasks, if it has failed too many times, it is not given a new task. Note that these counts (and penalty boxes) are per a job. Does that sound like a reasonable starting point? > jobs fail with some hardware/system failures on a small number of nodes > ----------------------------------------------------------------------- > > Key: HADOOP-654 > URL: http://issues.apache.org/jira/browse/HADOOP-654 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.7.2 > Reporter: Yoram Arnon > Assigned To: Owen O'Malley > Priority: Minor > > occasionally, such as when the OS is out of some resource, a node fails only > partly. The node is up and running, the task tracker is running and sending > heartbeats, but every task fails because the tasktracker can't fork tasks or > something. > In these cases, that task tracker keeps getting assigned tasks to execute, > and they all fail. > A couple of nodes like that and jobs start failing badly. > The job tracker should avoid assigning tasks to tasktrackers that are > misbehaving. > simple approach: avoid tasktrackers that report many more failures than > average (say 3X). Simply use the info sent by the TT. > better but harder: track TT failures over time and: > 1. avoid those that exhibit a high failure *rate* > 2. tell them to shut down -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira