[jira] Created: (HADOOP-654) jobs fail with some hardware/system failures on a small number of nodes

Yoram Arnon (JIRA) Mon, 30 Oct 2006 10:49:45 -0800

jobs fail with some hardware/system failures on a small number of nodes
-----------------------------------------------------------------------


                 Key: HADOOP-654
                 URL: http://issues.apache.org/jira/browse/HADOOP-654
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.7.2
            Reporter: Yoram Arnon
         Assigned To: Owen O'Malley
            Priority: Minor


occasionally, such as when the OS is out of some resource, a node fails only 
partly. The node is up and running, the task tracker is running and sending 
heartbeats, but every task fails because the tasktracker can't fork tasks or 
something.
In these cases, that task tracker keeps getting assigned tasks to execute, and 
they all fail.
A couple of nodes like that and jobs start failing badly.

The job tracker should avoid assigning tasks to tasktrackers that are 
misbehaving.

simple approach: avoid tasktrackers that report many more failures than average 
(say 3X). Simply use the info sent by the TT.
better but harder: track TT failures over time and:
 1. avoid those that exhibit a high failure *rate*
 2. tell them to shut down

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (HADOOP-654) jobs fail with some hardware/system failures on a small number of nodes

Reply via email to