[ 
https://issues.apache.org/jira/browse/SLIDER-438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139899#comment-14139899
 ] 

Gour Saha commented on SLIDER-438:
----------------------------------

You are right. The NM is dead so option 1 won't work. Plan is to take an 
approach similar to option 2.

> Slider agent continues to run in the container on a node where NM dies
> ----------------------------------------------------------------------
>
>                 Key: SLIDER-438
>                 URL: https://issues.apache.org/jira/browse/SLIDER-438
>             Project: Slider
>          Issue Type: Bug
>          Components: agent, agent-provider
>            Reporter: Gour Saha
>            Assignee: Gour Saha
>
> Steps to reproduce:
> - Setup a 3-node cluster (in non-HA mode)
> - Run slider create for HBase app-package (with HMaster and HRegionServer 
> components only - just to keep things simple)
> - Let's assume that the HRegionServer came up in a node different from that 
> of HMaster and Slider AM (if not, doing destroy-create couple of times will 
> definitely get you to this setup)
> - Kill the NM in the node where HRegionServer is running
> - Restart the NM within 10 minutes (which is the default time after which RM 
> marks the node as KILLED, configurable using 
> yarn.nm.liveness-monitor.expiry-interval-ms)
> - At this point Slider AM received the container lost event from RM, it 
> marked the container lost and requested a new one to RM. A new HRegionServer 
> container came up (in the same host where the old one was running). At this 
> point both the HRegionServer containers continued to run happily along side 
> each other and successfully heart-beating to AM.
> Expected:
> - Given that the first HRegionServer instance was still heart-beating with 
> AM, AM should be able to send a kill signal and bring the agent/container 
> down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to