[jira] [Commented] (MAPREDUCE-6771) RMContainerAllocator sends container diagnostics event after corresponding completion event

Haibo Chen (JIRA) Thu, 22 Sep 2016 22:28:43 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15515460#comment-15515460
 ]


Haibo Chen commented on MAPREDUCE-6771:
---------------------------------------

Thank you for your reivews, Jason!

I did think about have a black-box test for this, but it seems too much efforts 
to generate finished containers from RM side.  (Scheduler cannot inject 
finished containers into the AllocateReponse directly. Need to generate events 
in RM to trigger the RMAppAttemp to update its finished containers, which will 
then be returned to RMContainerAllocator in the heartbeat) But please let me 
know if there is an easy way to do so that I am not aware of.

Will address the rest of your comments in the new patch.


> RMContainerAllocator sends container diagnostics event after corresponding 
> completion event
> -------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6771
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6771
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.7.3
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>         Attachments: TaUnsuccessfullyEventEmission.jpg, 
> mapreduce6771.001.patch, mapreduce6771.002.patch, mapreduce6771.003.patch
>
>
> Task containers can go over their resource limit, and killed by Node Manager. 
> Then MR AM gets notified of the container status and diagnostics information 
> through its heartbeat with RM.  However, it is possible that the diagnostics 
> information never gets into .jhist file, so when the job completes, the 
> diagnostics information associated with the failed task attempts is empty.  
> This makes it hard for users to root cause job failures that are often caused 
> by memory leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-6771) RMContainerAllocator sends container diagnostics event after corresponding completion event

Reply via email to