[ https://issues.apache.org/jira/browse/HADOOP-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721647#action_12721647 ]
Owen O'Malley commented on HADOOP-6014: --------------------------------------- I guess, I'll go further that user jobs can demonstrate that a node is healthy, as long as we are willing to run tasks on it. But it is very hard to draw the inference the other way when you may have a run of "bad" jobs that are all expected to fail. About all that we could do is notice they are failing on all/many of the nodes and thus weaken their contribution to the node badness measure. > Improvements to Global Black-listing of TaskTrackers > ---------------------------------------------------- > > Key: HADOOP-6014 > URL: https://issues.apache.org/jira/browse/HADOOP-6014 > Project: Hadoop Core > Issue Type: Improvement > Components: mapred > Affects Versions: 0.20.0 > Reporter: Arun C Murthy > Fix For: 0.21.0 > > > HADOOP-4305 added a global black-list of tasktrackers. > We saw a scenario on one of our clusters where a few jobs caused a lot of > tasktrackers to immediately be blacklisted. This was caused by a specific set > of jobs which (same user) whose tasks were shot down the by the TaskTracker > for being over the vmem limit of 2G. Each of these jobs had over 600 failures > of the same kind. This resulted in each of the users black-listing some > tasktrackers, which in itself is wrong since the failures had nothing to do > with the node on which the failure occurred (i.e. high memory usage) and > shouldn't have had to penalized the tasktracker. We clearly need to start > treating system and user failures separately for black-listing etc. A > DiskError is fatal and should probably we blacklisted immediately while a > task which was 'failed' for using more memory shouldn't count against the > tasktracker at all! > The other problem is that we never configured mapred.max.tracker.blacklists > and continue to use the default value of 4. Further more this config should > really be a percent of the cluster-size and not a whole number. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.