[jira] [Commented] (MAPREDUCE-3184) Improve handling of fetch failures when a tasktracker is not responding on HTTP

Eli Collins (Commented) (JIRA) Wed, 19 Oct 2011 18:48:37 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131260#comment-13131260
 ]


Eli Collins commented on MAPREDUCE-3184:
----------------------------------------

I thought about some other workarounds (eg bumping up 
org.mortbay.io.nio.BUSY_PAUSE) but after thiking about it more I think what you 
have here  is a reasonable approach. Given where we are in MR1's life I agree a 
targeted approach makes more sense.

The code looks good. The only thing I'm wondering is whether we should disable 
the detection by default. The semantics of getThreadCpuTime aren't entirely 
clear (eg does it always return user or system time?) and platform (jdk/OS) 
specific (eg does IO time ever get counted?). Also according to the JDK docs 
"thread CPU measurement could be expensive in some Java virtual machine 
implementations." Eg see the following issues on some versions of Sun Java on 
Linux:
# 
http://download.oracle.com/javase/6/docs/api/java/lang/management/ThreadMXBean.html
# http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6888526
# http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6491083

Btw I think it's possible for a subsequent call to System#nanoTime to return a 
smaller value, though this shouldn't cause a false positive in the detection 
routine.

Nit: I'd up the info on JettyBugMonitor line 80 a warning since it's logged 
once and perhaps disabling a feature the user thinks they have.
                
> Improve handling of fetch failures when a tasktracker is not responding on 
> HTTP
> -------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3184
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3184
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobtracker
>    Affects Versions: 0.20.205.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: mr-3184.txt
>
>
> On a 100 node cluster, we had an issue where one of the TaskTrackers was hit 
> by MAPREDUCE-2386 and stopped responding to fetches. The behavior observed 
> was the following:
> - every reducer would try to fetch the same map task, and fail after ~13 
> minutes.
> - At that point, all reducers would report this failed fetch to the JT for 
> the same task, and the task would be re-run.
> - Meanwhile, the reducers would move on to the next map task that ran on the 
> TT, and hang for another 13 minutes.
> The job essentially made no progress for hours, as each map task that ran on 
> the bad node was serially marked failed.
> To combat this issue, we should introduce a second type of failed fetch 
> notification, used when the TT does not respond at all (ie 
> SocketTimeoutException, etc). These fetch failure notifications should count 
> against the TT at large, rather than a single task. If more than half of the 
> reducers report such an issue for a given TT, then all of the tasks from that 
> TT should be re-run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3184) Improve handling of fetch failures when a tasktracker is not responding on HTTP

Reply via email to