Improvements to Global Black-listing of TaskTrackers
----------------------------------------------------

                 Key: HADOOP-6014
                 URL: https://issues.apache.org/jira/browse/HADOOP-6014
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.20.0
            Reporter: Arun C Murthy
             Fix For: 0.21.0


HADOOP-4305 added a global black-list of tasktrackers.

We saw a scenario on one of our clusters where a few jobs caused a lot of 
tasktrackers to immediately be blacklisted. This was caused by a specific set 
of jobs which (same user) whose tasks were shot down the by the TaskTracker for 
being over the vmem limit of 2G. Each of these jobs had over 600 failures of 
the same kind. This resulted in each of the users black-listing some 
tasktrackers, which in itself is wrong since the failures had nothing to do 
with the node on which the failure occurred (i.e. high memory usage) and 
shouldn't have had to penalized the tasktracker. We clearly need to start 
treating system and user failures separately for black-listing etc. A DiskError 
is fatal and should probably we blacklisted immediately while a task which was 
'failed' for using more memory shouldn't count against the tasktracker at all!

The other problem is that we never configured mapred.max.tracker.blacklists and 
continue to use the default value of 4. Further more this config should really 
be a percent of the cluster-size and not a whole number. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to