[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15447008#comment-15447008
 ] 

Haibo Chen commented on MAPREDUCE-6771:
---------------------------------------

Thanks [~jlowe] for bringing up the case in MAPREDUCE-4955. Looking at that 
jira, it is indeed another case where diagnostics could be lost. 

bq. The AM would either need to postpone recording the attempt completion event 
until it receives the container completion event to see if there are any 
diagnostics or there needs to be a way to record postmortem diagnostics for 
attempts in the jhist file.
The diagnostics are included as part of a 
TaskAttemptUnsuccessfullyCompletionEvent, so my understanding of this is there 
should be ideally one such event in the jhist file (If we emit multiple 
instances, JobHistoryParser will always take the last instance seen in .jhist 
file). Please correct me if I am wrong. Therefore, I am thinking of postponing 
recording the unsuccessfully completion event.

bq. postpone recording the attempt completion event until it receives the 
container completion event to see if there are any diagnostics
TaskAttemptUnsuccessfullyCompletionEvent is generated upon receipt of TA_KILL, 
TA_TooManyFetchFailures and TA_FailMsg  Postponing the event emission until a 
container completion event is received makes the handling of TA_FAILMSG 
semantically inconsistent with that of other cases. I wonder if it is 
semantically cleaner to postpone the completion event until the transition into 
the final states (FAILED, KILLED).  The emission of 
TaskAttemptUnsuccessfullyCompletionEvents happens currently before transition 
into FAIL_FINISHING_CONTAINER, FAILED or KILLED state, but given that 
FAIL_FINISHING_CONTAINER will eventually transition into FAILED state, we could 
reduce the three cases into two (See the attachment show transitions during 
which an TaskAttemptUnsuccessfullyCompletionEvent is generated). That is, right 
before a task attempt goes into KILLED or FAILED, a 
TaskAttemptUnsuccessfullyCompletionEvents is written into the .jhist file.    



> Diagnostics information can be lost in .jhist if task containers are killed 
> by Node Manager.
> --------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6771
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6771
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.7.3
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>         Attachments: mapreduce6771.001.patch
>
>
> Task containers can go over their resource limit, and killed by Node Manager. 
> Then MR AM gets notified of the container status and diagnostics information 
> through its heartbeat with RM.  However, it is possible that the diagnostics 
> information never gets into .jhist file, so when the job completes, the 
> diagnostics information associated with the failed task attempts is empty.  
> This makes it hard for users to root cause job failures that are often caused 
> by memory leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to