[ https://issues.apache.org/jira/browse/HADOOP-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721662#action_12721662 ]
Amar Kamat commented on HADOOP-6014: ------------------------------------ bq. Maybe as a first step, we can just treat the failures that were explicitly initiated by the TaskTracker differently, and not have the TaskTracker be penalized for those. I think for now this will be a simple thing to do. A task can fail because of # code issues (failure e.g buggy code) # node issues (killed e.g disk) # mismatch (killed-failure e.g insufficient memory) In case #3, its not tt's fault and hence we should be less aggressive in deciding on such counts. bq. I'd tend to agree with Jim that we should just use HADOOP-5478 and revert the cross-job blacklisting. Cross blacklisting will still be required. Consider a case where a node's environment is messed up (all the basic apps e.g wc, sort etc are missing). In such case I dont think node scripts will help. Number of tasks/job failures looks like the right metric to me. > Improvements to Global Black-listing of TaskTrackers > ---------------------------------------------------- > > Key: HADOOP-6014 > URL: https://issues.apache.org/jira/browse/HADOOP-6014 > Project: Hadoop Core > Issue Type: Improvement > Components: mapred > Affects Versions: 0.20.0 > Reporter: Arun C Murthy > Fix For: 0.21.0 > > > HADOOP-4305 added a global black-list of tasktrackers. > We saw a scenario on one of our clusters where a few jobs caused a lot of > tasktrackers to immediately be blacklisted. This was caused by a specific set > of jobs which (same user) whose tasks were shot down the by the TaskTracker > for being over the vmem limit of 2G. Each of these jobs had over 600 failures > of the same kind. This resulted in each of the users black-listing some > tasktrackers, which in itself is wrong since the failures had nothing to do > with the node on which the failure occurred (i.e. high memory usage) and > shouldn't have had to penalized the tasktracker. We clearly need to start > treating system and user failures separately for black-listing etc. A > DiskError is fatal and should probably we blacklisted immediately while a > task which was 'failed' for using more memory shouldn't count against the > tasktracker at all! > The other problem is that we never configured mapred.max.tracker.blacklists > and continue to use the default value of 4. Further more this config should > really be a percent of the cluster-size and not a whole number. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.