[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15452936#comment-15452936
 ] 

Haibo Chen commented on MAPREDUCE-6771:
---------------------------------------

bq. Note that we aren't stuck with TaskAttemptUnsuccessfulCompletion event for 
doing diagnostics. 
Agree. I am guessing the reason why diagnostics is included in 
TaskAttemptUnsuccessfulCompletionEvent is users only want to see diagnostics 
when task attempts fail. Parsing a new event and ignoring such events for 
successful task attempts does need additional change.
bq.  but waiting for a container completion event is not something the state 
machine does today.
There is no need to wait for container completion event. My proposal is to wait 
for transition into FAILED state. As long as the task attempt goes into FAILED 
state, which does not necessarily need to be triggered by a container 
completion event (Time out (TA_TIMED_OUT) is already built-in in transitions 
from FAIL_FINISHING_CONTAINER to FAILED), the diagnostics will be written into 
jhist file. But your point of having a wide window is susceptible to AM crash 
is still very convincing.

Given that there is no clear preferable approach to address the case in 
MAPREDUCE-4955, do you think I can go ahead address the issue in this jira? The 
symptom of  MAPREDUCE-4955 and this one is the same, but the cause is not quite 
exactly. The case in MAPREDUCE-4955 happens when AM thinks the task attempt is 
already dead, or the diagnostics comes after a taskUnsuccessfulCompletionEvent 
is generated already, whereas the case in this jira happens when the 
diagnostics comes in while task attempt is still in running state, or before a 
taskUnsuccessfulCompletionEvent.  The case in this jira is easy to fix, and we 
can keep MAPREDUCE-4955 to address the other when we decide what to do.

> Diagnostics information can be lost in .jhist if task containers are killed 
> by Node Manager.
> --------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6771
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6771
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.7.3
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>         Attachments: TaUnsuccessfullyEventEmission.jpg, 
> mapreduce6771.001.patch
>
>
> Task containers can go over their resource limit, and killed by Node Manager. 
> Then MR AM gets notified of the container status and diagnostics information 
> through its heartbeat with RM.  However, it is possible that the diagnostics 
> information never gets into .jhist file, so when the job completes, the 
> diagnostics information associated with the failed task attempts is empty.  
> This makes it hard for users to root cause job failures that are often caused 
> by memory leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to