Gour Saha created SLIDER-439:
--------------------------------
Summary: RM never fulfilled Slider AM's container request after NM
died on a node where HRegionServer was running
Key: SLIDER-439
URL: https://issues.apache.org/jira/browse/SLIDER-439
Project: Slider
Issue Type: Bug
Components: appmaster
Reporter: Gour Saha
Steps to reproduce:
- Setup a 3-node cluster (in non-HA mode)
- Run slider create for HBase app-package (with HMaster and HRegionServer
components only - just to keep things simple)
- Let's assume that the HRegionServer came up in a node different from that of
HMaster and Slider AM (if not, doing destroy-create couple of times will
definitely get you to this setup)
- Kill the NM in the node where HRegionServer is running
- Wait for at least 10 minutes (do not restart NM on this node)
- At this point Slider AM received the onNodesUpdated and onContainersCompleted
events from RM, it unregistered the container and requested a new one to RM
- This time the request for a new container never got fulfilled even after
waiting for several minutes
Expected:
- Given that there was absolutely nothing else running on that cluster the
container request should have been fulfilled by RM
Interesting observation:
- After waiting long enough I restarted the NM on the node where it was killed
and surprisingly the new container request got fulfilled at that point and the
container with HRegionServer came up on the node where NM was killed. It seemed
like RM was waiting for the NM to come back up on this node (affinity?)
although it had marked it dead long time back.
Here is the Slider AM log snippet from the time it receives the onNodesUpdated
event -
{noformat}
14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Nodes updated
14/09/17 07:02:47 INFO appmaster.SliderAppMaster: onContainersCompleted([1]
14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Container Completion for
containerID=container_1410935367006_0001_01_000002, state=COMPLETE,
exitStatus=-100, diagnostics=Container released on a *lost* node
14/09/17 07:02:47 INFO state.AppState: Failed container in role[2] :
HBASE_REGIONSERVER
14/09/17 07:02:47 INFO state.AppState: Current count of failed role[2]
HBASE_REGIONSERVER = 1
14/09/17 07:02:47 INFO state.AppState: Removing node ID
container_1410935367006_0001_01_000002
14/09/17 07:02:47 ERROR appmaster.SliderAppMaster: Role instance
RoleInstance{role='HBASE_REGIONSERVER',
id='container_1410935367006_0001_01_000002',
container=ContainerID=container_1410935367006_0001_01_000002
nodeID=c6403.ambari.apache.org:45454 http=c6403.ambari.apache.org:8042
priority=2, createTime=1410936271481, startTime=1410936271543, released=false,
roleId=2, host=c6403.ambari.apache.org,
hostURL=http://c6403.ambari.apache.org:8042, state=5, exitCode=-100,
command='python ./infra/agent/slider-agent/agent/main.py --label
container_1410935367006_0001_01_000002___HBASE_REGIONSERVER --zk-quorum
c6401.ambari.apache.org:2181,c6402.ambari.apache.org:2181,c6403.ambari.apache.org:2181
--zk-reg-path /registry/org-apache-slider/cl1 > <LOG_DIR>/agent.out 2>&1 ; ',
diagnostics='Container released on a *lost* node', output=null,
environment=[AGENT_WORK_ROOT="$PWD", HADOOP_USER_NAME="yarn",
AGENT_LOG_ROOT="$LOG_DIRS", PYTHONPATH="./infra/agent/slider-agent/",
SLIDER_PASSPHRASE="DEV"]} failed
14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Unregistering component
container_1410935367006_0001_01_000002
14/09/17 07:02:47 INFO state.AppState: RoleStatus{name='HBASE_REGIONSERVER',
key=2, desired=1, actual=0, requested=0, releasing=0, failed=1, started=1,
startFailed=0, completed=0, failureMessage='Failure
container_1410935367006_0001_01_000002 on host c6403.ambari.apache.org:
http://c6402.ambari.apache.org:19888/jobhistory/logs/c6403.ambari.apache.org:45454/container_1410935367006_0001_01_000002/ctx/yarn'}
14/09/17 07:02:47 INFO state.AppState: HBASE_REGIONSERVER: Asking for 1 more
nodes(s) for a total of 1
14/09/17 07:02:47 INFO state.RoleHistory: There're 1 nodes to consider for
HBASE_REGIONSERVER
14/09/17 07:02:47 INFO state.OutstandingRequest: Submitting request for
container on c6403.ambari.apache.org
14/09/17 07:02:47 INFO state.AppState: Container ask is Capability[<memory:256,
vCores:1>]Priority[2]
14/09/17 07:02:47 INFO state.AppState: RoleStatus{name='HBASE_MASTER', key=1,
desired=1, actual=1, requested=0, releasing=0, failed=0, started=1,
startFailed=0, completed=0, failureMessage=''}
14/09/17 07:02:47 INFO util.RackResolver: Resolved c6403.ambari.apache.org to
/default-rack
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)