[jira] [Commented] (MAPREDUCE-3184) Improve handling of fetch failures when a tasktracker is not responding on HTTP

Todd Lipcon (Commented) (JIRA) Tue, 25 Oct 2011 10:34:53 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135261#comment-13135261
 ]


Todd Lipcon commented on MAPREDUCE-3184:
----------------------------------------

bq. eg does it always return user or system time
it returns the sum of the two. The docs aren't well written ("If the 
implementation distinguishes between user mode time and system mode time, the 
returned CPU time is the amount of time that the thread has executed in user 
mode or system mode") but looking at the source the intention is clear.

Worst case, if an implementation returns only user or only system time, then 
the CPU usage will be under-estimated, which is OK. As long as we don't 
over-estimate, it won't cause a false shutdown.

bq. thread CPU measurement could be expensive in some Java virtual machine 
implementations

Looking at those sun bugs, I think "expensive" here is a relative term. For 
example, the bug says "the submitted test case takes almost 4 seconds to do 
100k CPU measurements" - 40us per call. Given we make the call once every 15 
seconds, I'm not too concerned. I think the "may be expensive" warning is just 
to warn people not to sprinkle these calls throughout their 
performance-sensitive code to do metrics, etc.

bq. Btw I think it's possible for a subsequent call to System#nanoTime to 
return a smaller valu
I don't think so - nanotime is "time since a fixed but arbitrary point in the 
past". On Linux it's implemented with  clock_gettime(CLOCK_MONOTONIC)
                
> Improve handling of fetch failures when a tasktracker is not responding on 
> HTTP
> -------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3184
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3184
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobtracker
>    Affects Versions: 0.20.205.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: mr-3184.txt
>
>
> On a 100 node cluster, we had an issue where one of the TaskTrackers was hit 
> by MAPREDUCE-2386 and stopped responding to fetches. The behavior observed 
> was the following:
> - every reducer would try to fetch the same map task, and fail after ~13 
> minutes.
> - At that point, all reducers would report this failed fetch to the JT for 
> the same task, and the task would be re-run.
> - Meanwhile, the reducers would move on to the next map task that ran on the 
> TT, and hang for another 13 minutes.
> The job essentially made no progress for hours, as each map task that ran on 
> the bad node was serially marked failed.
> To combat this issue, we should introduce a second type of failed fetch 
> notification, used when the TT does not respond at all (ie 
> SocketTimeoutException, etc). These fetch failure notifications should count 
> against the TT at large, rather than a single task. If more than half of the 
> reducers report such an issue for a given TT, then all of the tasks from that 
> TT should be re-run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3184) Improve handling of fetch failures when a tasktracker is not responding on HTTP

Reply via email to