Fix the per-job tasktracker 'blacklist'
---------------------------------------

                 Key: HADOOP-1278
                 URL: https://issues.apache.org/jira/browse/HADOOP-1278
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
            Reporter: Arun C Murthy
         Assigned To: Arun C Murthy
             Fix For: 0.13.0


Today whenever a tracker is 'lost' all the jobs which ever ran on it are 
considered as failures and added to the blacklist, which automatically ensures 
that the particular TT is *never* considered for allocating new tasks unless 
*all* tasktrackers are on the list. This results in an ugly situation where a 
majority of nodes in the cluster are on the blacklist and hence idle, while the 
other TTs are maxed out.

The proposal is two-fold:
a) Don't count *all* tasks which ever ran on the TT, we can count it as a 
'single' task failure - which means that each 'lost' tracker results in a loss 
of 20% of the '5 failures == blacklisted'  quota.
b) Stop adding nodes to the blacklist when a certain percentage of the cluster, 
say 25%, are already on the blacklist - adding more than that would just delay 
the inevitable i.e. there is something horrendously wrong with the cluster - we 
might as well fail the job early and noisily.

Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to