[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096088#comment-14096088
 ] 

Jason Lowe commented on MAPREDUCE-4818:
---------------------------------------

Not sure changing the long-standing meaning of progress for maps and reduces 
would be OK.  Adding a new localization phase could cause some problems with 
speculative execution since the default one assumes progress is linear which 
usually works well for map tasks.  (Reducers not so much as that's already 
non-linear.)  So as you said if we did add it I think we'd be stuck with 
keeping the size of that phase at 0 for quite some time.

However I'm a big fan of adding some kind of localization-related status update 
message, and I think that would get us most of the benefit without having to 
add an explicit new LOCALIZING state or phase for tasks.  We'd have to add such 
a state if we want the AM to have a separate timeout or separate progress phase 
for localization, but I think just having an appropriate localization status 
message associated with the task when it times out would be helpful and 
hopefully much less involved.  Unfortunately the task isn't running when 
localizing, so we can't get this via the task umbilical like other status 
updates.  The AM would have to do an explicit container status query to the NM 
to obtain the message, and the NM would have to update container status as 
localization progressed.  Or were you thinking of a different approach?

> Easier identification of tasks that timeout during localization
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4818
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4818
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.3-alpha
>            Reporter: Jason Lowe
>              Labels: usability
>
> When a task is taking too long to localize and is killed by the AM due to 
> task timeout, the job UI/history is not very helpful.  The attempt simply 
> lists a diagnostic stating it was killed due to timeout, but there are no 
> logs for the attempt since it never actually got started.  There are log 
> messages on the NM that show the container never made it past localization by 
> the time it was killed, but users often do not have access to those logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to