Fix the per-job tasktracker 'blacklist'
---------------------------------------
Key: HADOOP-1278
URL: https://issues.apache.org/jira/browse/HADOOP-1278
Project: Hadoop
Issue Type: Bug
Components: mapred
Reporter: Arun C Murthy
Assigned To: Arun C Murthy
Fix For: 0.13.0
Today whenever a tracker is 'lost' all the jobs which ever ran on it are
considered as failures and added to the blacklist, which automatically ensures
that the particular TT is *never* considered for allocating new tasks unless
*all* tasktrackers are on the list. This results in an ugly situation where a
majority of nodes in the cluster are on the blacklist and hence idle, while the
other TTs are maxed out.
The proposal is two-fold:
a) Don't count *all* tasks which ever ran on the TT, we can count it as a
'single' task failure - which means that each 'lost' tracker results in a loss
of 20% of the '5 failures == blacklisted' quota.
b) Stop adding nodes to the blacklist when a certain percentage of the cluster,
say 25%, are already on the blacklist - adding more than that would just delay
the inevitable i.e. there is something horrendously wrong with the cluster - we
might as well fail the job early and noisily.
Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.