[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127884#comment-13127884
 ] 

Todd Lipcon commented on MAPREDUCE-3184:
----------------------------------------

I remember hearing at one point that another use case for the health script was 
to check for things like broken NFS mounts or missing shared libraries on a set 
of nodes. In those cases, it would make sense to not schedule new tasks, but 
doesn't make sense to lose the already-completed task outputs.

Another point here is that monitoring for timeouts on HTTP GET is insufficient 
-- in the case that a TT is highly loaded, GETs would time out even though 
tasks are successfully retrieving map output. To correctly identify the 
scenario, we'd need to do one of the following:
1) impose a separate limit for number of concurrent MapOutputServlet 
invocations, which is a little lower than the limit for 
tasktracker.http.threads. If MapOutputServlet is requested when too many 
invocations are in progress, it would return an HTTP result code indicating 
that the TT is too busy -- distinct from the timeout that it currently returns.
2) have smarter monitoring on the client side which looks at a metric like CPU 
usage to detect the "100% spinning" case.

Another option to consider would be to have the TT have a small thread which 
issues HTTP GETs to itself. If it times out, it can check how many Jetty 
threads are actually actively serving requests. If there are free Jetty 
threads, but can't serve output, the TT can just FATAL itself.

I think this last option may be preferable to the health-check approach, since 
it would ship with Hadoop without any extra configuration on the part of the 
user, and would be less prone to false detection issues.
                
> Improve handling of fetch failures when a tasktracker is not responding on 
> HTTP
> -------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3184
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3184
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobtracker
>    Affects Versions: 0.20.205.0
>            Reporter: Todd Lipcon
>
> On a 100 node cluster, we had an issue where one of the TaskTrackers was hit 
> by MAPREDUCE-2386 and stopped responding to fetches. The behavior observed 
> was the following:
> - every reducer would try to fetch the same map task, and fail after ~13 
> minutes.
> - At that point, all reducers would report this failed fetch to the JT for 
> the same task, and the task would be re-run.
> - Meanwhile, the reducers would move on to the next map task that ran on the 
> TT, and hang for another 13 minutes.
> The job essentially made no progress for hours, as each map task that ran on 
> the bad node was serially marked failed.
> To combat this issue, we should introduce a second type of failed fetch 
> notification, used when the TT does not respond at all (ie 
> SocketTimeoutException, etc). These fetch failure notifications should count 
> against the TT at large, rather than a single task. If more than half of the 
> reducers report such an issue for a given TT, then all of the tasks from that 
> TT should be re-run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to