[ https://issues.apache.org/jira/browse/MAPREDUCE-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096088#comment-14096088 ]
Jason Lowe commented on MAPREDUCE-4818: --------------------------------------- Not sure changing the long-standing meaning of progress for maps and reduces would be OK. Adding a new localization phase could cause some problems with speculative execution since the default one assumes progress is linear which usually works well for map tasks. (Reducers not so much as that's already non-linear.) So as you said if we did add it I think we'd be stuck with keeping the size of that phase at 0 for quite some time. However I'm a big fan of adding some kind of localization-related status update message, and I think that would get us most of the benefit without having to add an explicit new LOCALIZING state or phase for tasks. We'd have to add such a state if we want the AM to have a separate timeout or separate progress phase for localization, but I think just having an appropriate localization status message associated with the task when it times out would be helpful and hopefully much less involved. Unfortunately the task isn't running when localizing, so we can't get this via the task umbilical like other status updates. The AM would have to do an explicit container status query to the NM to obtain the message, and the NM would have to update container status as localization progressed. Or were you thinking of a different approach? > Easier identification of tasks that timeout during localization > --------------------------------------------------------------- > > Key: MAPREDUCE-4818 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4818 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am > Affects Versions: 0.23.3, 2.0.3-alpha > Reporter: Jason Lowe > Labels: usability > > When a task is taking too long to localize and is killed by the AM due to > task timeout, the job UI/history is not very helpful. The attempt simply > lists a diagnostic stating it was killed due to timeout, but there are no > logs for the attempt since it never actually got started. There are log > messages on the NM that show the container never made it past localization by > the time it was killed, but users often do not have access to those logs. -- This message was sent by Atlassian JIRA (v6.2#6252)