Gour Saha created SLIDER-439:
--------------------------------

             Summary: RM never fulfilled Slider AM's container request after NM 
died on a node where HRegionServer was running
                 Key: SLIDER-439
                 URL: https://issues.apache.org/jira/browse/SLIDER-439
             Project: Slider
          Issue Type: Bug
          Components: appmaster
            Reporter: Gour Saha


Steps to reproduce:
- Setup a 3-node cluster (in non-HA mode)
- Run slider create for HBase app-package (with HMaster and HRegionServer 
components only - just to keep things simple)
- Let's assume that the HRegionServer came up in a node different from that of 
HMaster and Slider AM (if not, doing destroy-create couple of times will 
definitely get you to this setup)
- Kill the NM in the node where HRegionServer is running
- Wait for at least 10 minutes (do not restart NM on this node)
- At this point Slider AM received the onNodesUpdated and onContainersCompleted 
events from RM, it unregistered the container and requested a new one to RM
- This time the request for a new container never got fulfilled even after 
waiting for several minutes

Expected:
- Given that there was absolutely nothing else running on that cluster the 
container request should have been fulfilled by RM

Interesting observation:
- After waiting long enough I restarted the NM on the node where it was killed 
and surprisingly the new container request got fulfilled at that point and the 
container with HRegionServer came up on the node where NM was killed. It seemed 
like RM was waiting for the NM to come back up on this node (affinity?) 
although it had marked it dead long time back.


Here is the Slider AM log snippet from the time it receives the onNodesUpdated 
event -
{noformat}
14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Nodes updated
14/09/17 07:02:47 INFO appmaster.SliderAppMaster: onContainersCompleted([1]
14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Container Completion for 
containerID=container_1410935367006_0001_01_000002, state=COMPLETE, 
exitStatus=-100, diagnostics=Container released on a *lost* node
14/09/17 07:02:47 INFO state.AppState: Failed container in role[2] : 
HBASE_REGIONSERVER
14/09/17 07:02:47 INFO state.AppState: Current count of failed role[2] 
HBASE_REGIONSERVER =  1
14/09/17 07:02:47 INFO state.AppState: Removing node ID 
container_1410935367006_0001_01_000002
14/09/17 07:02:47 ERROR appmaster.SliderAppMaster: Role instance 
RoleInstance{role='HBASE_REGIONSERVER', 
id='container_1410935367006_0001_01_000002', 
container=ContainerID=container_1410935367006_0001_01_000002 
nodeID=c6403.ambari.apache.org:45454 http=c6403.ambari.apache.org:8042 
priority=2, createTime=1410936271481, startTime=1410936271543, released=false, 
roleId=2, host=c6403.ambari.apache.org, 
hostURL=http://c6403.ambari.apache.org:8042, state=5, exitCode=-100, 
command='python ./infra/agent/slider-agent/agent/main.py --label 
container_1410935367006_0001_01_000002___HBASE_REGIONSERVER --zk-quorum 
c6401.ambari.apache.org:2181,c6402.ambari.apache.org:2181,c6403.ambari.apache.org:2181
 --zk-reg-path /registry/org-apache-slider/cl1 > <LOG_DIR>/agent.out 2>&1 ; ', 
diagnostics='Container released on a *lost* node', output=null, 
environment=[AGENT_WORK_ROOT="$PWD", HADOOP_USER_NAME="yarn", 
AGENT_LOG_ROOT="$LOG_DIRS", PYTHONPATH="./infra/agent/slider-agent/", 
SLIDER_PASSPHRASE="DEV"]} failed
14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Unregistering component 
container_1410935367006_0001_01_000002
14/09/17 07:02:47 INFO state.AppState: RoleStatus{name='HBASE_REGIONSERVER', 
key=2, desired=1, actual=0, requested=0, releasing=0, failed=1, started=1, 
startFailed=0, completed=0, failureMessage='Failure 
container_1410935367006_0001_01_000002 on host c6403.ambari.apache.org: 
http://c6402.ambari.apache.org:19888/jobhistory/logs/c6403.ambari.apache.org:45454/container_1410935367006_0001_01_000002/ctx/yarn'}
14/09/17 07:02:47 INFO state.AppState: HBASE_REGIONSERVER: Asking for 1 more 
nodes(s) for a total of 1
14/09/17 07:02:47 INFO state.RoleHistory: There're 1 nodes to consider for 
HBASE_REGIONSERVER
14/09/17 07:02:47 INFO state.OutstandingRequest: Submitting request for 
container on c6403.ambari.apache.org
14/09/17 07:02:47 INFO state.AppState: Container ask is Capability[<memory:256, 
vCores:1>]Priority[2]
14/09/17 07:02:47 INFO state.AppState: RoleStatus{name='HBASE_MASTER', key=1, 
desired=1, actual=1, requested=0, releasing=0, failed=0, started=1, 
startFailed=0, completed=0, failureMessage=''}
14/09/17 07:02:47 INFO util.RackResolver: Resolved c6403.ambari.apache.org to 
/default-rack
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to