[ https://issues.apache.org/jira/browse/HADOOP-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721701#action_12721701 ]
Amareshwari Sriramadasu commented on HADOOP-6014: ------------------------------------------------- bq. But it is very hard to draw the inference the other way when you may have a run of "bad" jobs that are all expected to fail. Current blacklisting strategy looks at trackers blacklisted by Successful jobs. Also, a TT gets blacklisted onlyif #blacklists for the tracker is 50% above the average #blacklists, over the active and potentially faulty trackers > Improvements to Global Black-listing of TaskTrackers > ---------------------------------------------------- > > Key: HADOOP-6014 > URL: https://issues.apache.org/jira/browse/HADOOP-6014 > Project: Hadoop Core > Issue Type: Improvement > Components: mapred > Affects Versions: 0.20.0 > Reporter: Arun C Murthy > Fix For: 0.21.0 > > > HADOOP-4305 added a global black-list of tasktrackers. > We saw a scenario on one of our clusters where a few jobs caused a lot of > tasktrackers to immediately be blacklisted. This was caused by a specific set > of jobs which (same user) whose tasks were shot down the by the TaskTracker > for being over the vmem limit of 2G. Each of these jobs had over 600 failures > of the same kind. This resulted in each of the users black-listing some > tasktrackers, which in itself is wrong since the failures had nothing to do > with the node on which the failure occurred (i.e. high memory usage) and > shouldn't have had to penalized the tasktracker. We clearly need to start > treating system and user failures separately for black-listing etc. A > DiskError is fatal and should probably we blacklisted immediately while a > task which was 'failed' for using more memory shouldn't count against the > tasktracker at all! > The other problem is that we never configured mapred.max.tracker.blacklists > and continue to use the default value of 4. Further more this config should > really be a percent of the cluster-size and not a whole number. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.