[
https://issues.apache.org/jira/browse/SLIDER-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138737#comment-14138737
]
Steve Loughran commented on SLIDER-439:
---------------------------------------
This looks like a YARN quirk. Do you want to file a JIRA there?
> RM never fulfilled Slider AM's container request after NM died on a node
> where HRegionServer was running
> --------------------------------------------------------------------------------------------------------
>
> Key: SLIDER-439
> URL: https://issues.apache.org/jira/browse/SLIDER-439
> Project: Slider
> Issue Type: Bug
> Components: appmaster
> Reporter: Gour Saha
> Assignee: Steve Loughran
>
> Steps to reproduce:
> - Setup a 3-node cluster (in non-HA mode)
> - Run slider create for HBase app-package (with HMaster and HRegionServer
> components only - just to keep things simple)
> - Let's assume that the HRegionServer came up in a node different from that
> of HMaster and Slider AM (if not, doing destroy-create couple of times will
> definitely get you to this setup)
> - Kill the NM in the node where HRegionServer is running
> - Wait for at least 10 minutes (do not restart NM on this node)
> - At this point Slider AM received the onNodesUpdated and
> onContainersCompleted events from RM, it unregistered the container and
> requested a new one to RM
> - This time the request for a new container never got fulfilled even after
> waiting for several minutes
> Expected:
> - Given that there was absolutely nothing else running on that cluster the
> container request should have been fulfilled by RM
> Interesting observation:
> - After waiting long enough I restarted the NM on the node where it was
> killed and surprisingly the new container request got fulfilled at that point
> and the container with HRegionServer came up on the node where NM was killed.
> It seemed like RM was waiting for the NM to come back up on this node
> (affinity?) although it had marked it dead long time back.
> Here is the Slider AM log snippet from the time it receives the
> onNodesUpdated event -
> {noformat}
> 14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Nodes updated
> 14/09/17 07:02:47 INFO appmaster.SliderAppMaster: onContainersCompleted([1]
> 14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Container Completion for
> containerID=container_1410935367006_0001_01_000002, state=COMPLETE,
> exitStatus=-100, diagnostics=Container released on a *lost* node
> 14/09/17 07:02:47 INFO state.AppState: Failed container in role[2] :
> HBASE_REGIONSERVER
> 14/09/17 07:02:47 INFO state.AppState: Current count of failed role[2]
> HBASE_REGIONSERVER = 1
> 14/09/17 07:02:47 INFO state.AppState: Removing node ID
> container_1410935367006_0001_01_000002
> 14/09/17 07:02:47 ERROR appmaster.SliderAppMaster: Role instance
> RoleInstance{role='HBASE_REGIONSERVER',
> id='container_1410935367006_0001_01_000002',
> container=ContainerID=container_1410935367006_0001_01_000002
> nodeID=c6403.ambari.apache.org:45454 http=c6403.ambari.apache.org:8042
> priority=2, createTime=1410936271481, startTime=1410936271543,
> released=false, roleId=2, host=c6403.ambari.apache.org,
> hostURL=http://c6403.ambari.apache.org:8042, state=5, exitCode=-100,
> command='python ./infra/agent/slider-agent/agent/main.py --label
> container_1410935367006_0001_01_000002___HBASE_REGIONSERVER --zk-quorum
> c6401.ambari.apache.org:2181,c6402.ambari.apache.org:2181,c6403.ambari.apache.org:2181
> --zk-reg-path /registry/org-apache-slider/cl1 > <LOG_DIR>/agent.out 2>&1 ;
> ', diagnostics='Container released on a *lost* node', output=null,
> environment=[AGENT_WORK_ROOT="$PWD", HADOOP_USER_NAME="yarn",
> AGENT_LOG_ROOT="$LOG_DIRS", PYTHONPATH="./infra/agent/slider-agent/",
> SLIDER_PASSPHRASE="DEV"]} failed
> 14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Unregistering component
> container_1410935367006_0001_01_000002
> 14/09/17 07:02:47 INFO state.AppState: RoleStatus{name='HBASE_REGIONSERVER',
> key=2, desired=1, actual=0, requested=0, releasing=0, failed=1, started=1,
> startFailed=0, completed=0, failureMessage='Failure
> container_1410935367006_0001_01_000002 on host c6403.ambari.apache.org:
> http://c6402.ambari.apache.org:19888/jobhistory/logs/c6403.ambari.apache.org:45454/container_1410935367006_0001_01_000002/ctx/yarn'}
> 14/09/17 07:02:47 INFO state.AppState: HBASE_REGIONSERVER: Asking for 1 more
> nodes(s) for a total of 1
> 14/09/17 07:02:47 INFO state.RoleHistory: There're 1 nodes to consider for
> HBASE_REGIONSERVER
> 14/09/17 07:02:47 INFO state.OutstandingRequest: Submitting request for
> container on c6403.ambari.apache.org
> 14/09/17 07:02:47 INFO state.AppState: Container ask is
> Capability[<memory:256, vCores:1>]Priority[2]
> 14/09/17 07:02:47 INFO state.AppState: RoleStatus{name='HBASE_MASTER', key=1,
> desired=1, actual=1, requested=0, releasing=0, failed=0, started=1,
> startFailed=0, completed=0, failureMessage=''}
> 14/09/17 07:02:47 INFO util.RackResolver: Resolved c6403.ambari.apache.org to
> /default-rack
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)