[ 
https://issues.apache.org/jira/browse/HADOOP-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721662#action_12721662
 ] 

Amar Kamat commented on HADOOP-6014:
------------------------------------

bq. Maybe as a first step, we can just treat the failures that were explicitly 
initiated by the TaskTracker differently, and not have the TaskTracker be 
penalized for those. 
I think for now this will be a simple thing to do. A task can fail because of 
# code issues (failure e.g buggy code)
# node issues (killed e.g disk)
# mismatch (killed-failure e.g insufficient memory) 

In case #3, its not tt's fault and hence we should be less aggressive in 
deciding on such counts.

bq. I'd tend to agree with Jim that we should just use HADOOP-5478 and revert 
the cross-job blacklisting.
Cross blacklisting will still be required. Consider a case where a node's 
environment is messed up (all the basic apps e.g wc, sort etc are missing). In 
such case I dont think node scripts will help. Number of tasks/job failures 
looks like the right metric to me. 

> Improvements to Global Black-listing of TaskTrackers
> ----------------------------------------------------
>
>                 Key: HADOOP-6014
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6014
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Arun C Murthy
>             Fix For: 0.21.0
>
>
> HADOOP-4305 added a global black-list of tasktrackers.
> We saw a scenario on one of our clusters where a few jobs caused a lot of 
> tasktrackers to immediately be blacklisted. This was caused by a specific set 
> of jobs which (same user) whose tasks were shot down the by the TaskTracker 
> for being over the vmem limit of 2G. Each of these jobs had over 600 failures 
> of the same kind. This resulted in each of the users black-listing some 
> tasktrackers, which in itself is wrong since the failures had nothing to do 
> with the node on which the failure occurred (i.e. high memory usage) and 
> shouldn't have had to penalized the tasktracker. We clearly need to start 
> treating system and user failures separately for black-listing etc. A 
> DiskError is fatal and should probably we blacklisted immediately while a 
> task which was 'failed' for using more memory shouldn't count against the 
> tasktracker at all!
> The other problem is that we never configured mapred.max.tracker.blacklists 
> and continue to use the default value of 4. Further more this config should 
> really be a percent of the cluster-size and not a whole number. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to