[ https://issues.apache.org/jira/browse/MAPREDUCE-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129365#comment-13129365 ]
Todd Lipcon commented on MAPREDUCE-3184: ---------------------------------------- Having talked with Arun and Aaron about this a bit, we came up with a few candidate solutions, several of which are described above. However, several of the solutions would require semi-invasive changes to the JT or TT, or require semantic changes to the behavior of the health check script. As Arun put it, we don't want to introduce a generalized solution when the issue here is a very specific Jetty bug -- the generalized solution might have other ill effects that would be hard to pin down, making the change hard to verify. So, instead, the approach we will take is to apply a very specific fix for this very specific Jetty issue: start a thread inside the TT which monitors for a spinning Jetty selector thread, and if detected, shut down the TT. This will cause any reducers to immediately start receiving "Connection refused" errors and recover from the situation rapidly. Existing monitoring scripts will easily notice the failed TT so that the admin can restart. Clearly, this fix is hack-ish, but it's a hack localized to the scope of the TT, and a new thread in the TT at that. Thus it has little chance of causing regressions with regard to other shuffle heuristics, etc. I will upload a patch momentarily. > Improve handling of fetch failures when a tasktracker is not responding on > HTTP > ------------------------------------------------------------------------------- > > Key: MAPREDUCE-3184 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3184 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: jobtracker > Affects Versions: 0.20.205.0 > Reporter: Todd Lipcon > > On a 100 node cluster, we had an issue where one of the TaskTrackers was hit > by MAPREDUCE-2386 and stopped responding to fetches. The behavior observed > was the following: > - every reducer would try to fetch the same map task, and fail after ~13 > minutes. > - At that point, all reducers would report this failed fetch to the JT for > the same task, and the task would be re-run. > - Meanwhile, the reducers would move on to the next map task that ran on the > TT, and hang for another 13 minutes. > The job essentially made no progress for hours, as each map task that ran on > the bad node was serially marked failed. > To combat this issue, we should introduce a second type of failed fetch > notification, used when the TT does not respond at all (ie > SocketTimeoutException, etc). These fetch failure notifications should count > against the TT at large, rather than a single task. If more than half of the > reducers report such an issue for a given TT, then all of the tasks from that > TT should be re-run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira