Gour Saha created SLIDER-438:
--------------------------------
Summary: Slider agent continues to run in the container on a node
where NM dies
Key: SLIDER-438
URL: https://issues.apache.org/jira/browse/SLIDER-438
Project: Slider
Issue Type: Bug
Components: agent, agent-provider
Reporter: Gour Saha
Steps to reproduce:
- Setup a 3-node cluster (in non-HA mode)
- Run slider create for HBase app-package (with HMaster and HRegionServer
components only - just to keep things simple)
- Let's assume that the HRegionServer came up in a node different from that of
HMaster and Slider AM (if not, doing destroy-create couple of times will
definitely get you to this setup)
- Kill the NM in the node where HRegionServer is running
- Restart the NM within 10 minutes (which is the default time after which RM
marks the node as KILLED, configurable using
yarn.nm.liveness-monitor.expiry-interval-ms)
- At this point Slider AM received the container lost event from RM, it marked
the container lost and requested a new one to RM. A new HRegionServer container
came up (in the same host where the old one was running). At this point both
the HRegionServer containers continued to run happily along side each other and
successfully heart-beating to AM.
Expected:
- Given that the first HRegionServer instance was still heart-beating with AM,
AM should be able to send a kill signal and bring the agent/container down.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)