[jira] [Commented] (SLIDER-439) RM never fulfilled Slider AM's container request after NM died on a node where HRegionServer was running

ASF subversion and git services (JIRA) Wed, 22 Oct 2014 21:26:12 -0700

    [ 
https://issues.apache.org/jira/browse/SLIDER-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181004#comment-14181004
 ]


ASF subversion and git services commented on SLIDER-439:
--------------------------------------------------------

Commit 34b909a8edb551b6e9aa7f5ab2b3f6bd04f1b7c5 in incubator-slider's branch 
refs/heads/develop from [~gsaha]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-slider.git;h=34b909a ]

SLIDER-439 RM never fulfilled Slider AM's container request after NM died on a 
node where HRegionServer was running


> RM never fulfilled Slider AM's container request after NM died on a node 
> where HRegionServer was running
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SLIDER-439
>                 URL: https://issues.apache.org/jira/browse/SLIDER-439
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster
>            Reporter: Gour Saha
>            Assignee: Gour Saha
>            Priority: Critical
>             Fix For: Slider 0.60
>
>
> Steps to reproduce:
> - Setup a 3-node cluster (in non-HA mode)
> - Run slider create for HBase app-package (with HMaster and HRegionServer 
> components only - just to keep things simple)
> - Let's assume that the HRegionServer came up in a node different from that 
> of HMaster and Slider AM (if not, doing destroy-create couple of times will 
> definitely get you to this setup)
> - Kill the NM in the node where HRegionServer is running
> - Wait for at least 10 minutes (do not restart NM on this node)
> - At this point Slider AM received the onNodesUpdated and 
> onContainersCompleted events from RM, it unregistered the container and 
> requested a new one to RM
> - This time the request for a new container never got fulfilled even after 
> waiting for several minutes
> Expected:
> - Given that there was absolutely nothing else running on that cluster the 
> container request should have been fulfilled by RM
> Interesting observation:
> - After waiting long enough I restarted the NM on the node where it was 
> killed and surprisingly the new container request got fulfilled at that point 
> and the container with HRegionServer came up on the node where NM was killed. 
> It seemed like RM was waiting for the NM to come back up on this node 
> (affinity?) although it had marked it dead long time back.
> Here is the Slider AM log snippet from the time it receives the 
> onNodesUpdated event -
> {noformat}
> 14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Nodes updated
> 14/09/17 07:02:47 INFO appmaster.SliderAppMaster: onContainersCompleted([1]
> 14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Container Completion for 
> containerID=container_1410935367006_0001_01_000002, state=COMPLETE, 
> exitStatus=-100, diagnostics=Container released on a *lost* node
> 14/09/17 07:02:47 INFO state.AppState: Failed container in role[2] : 
> HBASE_REGIONSERVER
> 14/09/17 07:02:47 INFO state.AppState: Current count of failed role[2] 
> HBASE_REGIONSERVER =  1
> 14/09/17 07:02:47 INFO state.AppState: Removing node ID 
> container_1410935367006_0001_01_000002
> 14/09/17 07:02:47 ERROR appmaster.SliderAppMaster: Role instance 
> RoleInstance{role='HBASE_REGIONSERVER', 
> id='container_1410935367006_0001_01_000002', 
> container=ContainerID=container_1410935367006_0001_01_000002 
> nodeID=c6403.ambari.apache.org:45454 http=c6403.ambari.apache.org:8042 
> priority=2, createTime=1410936271481, startTime=1410936271543, 
> released=false, roleId=2, host=c6403.ambari.apache.org, 
> hostURL=http://c6403.ambari.apache.org:8042, state=5, exitCode=-100, 
> command='python ./infra/agent/slider-agent/agent/main.py --label 
> container_1410935367006_0001_01_000002___HBASE_REGIONSERVER --zk-quorum 
> c6401.ambari.apache.org:2181,c6402.ambari.apache.org:2181,c6403.ambari.apache.org:2181
>  --zk-reg-path /registry/org-apache-slider/cl1 > <LOG_DIR>/agent.out 2>&1 ; 
> ', diagnostics='Container released on a *lost* node', output=null, 
> environment=[AGENT_WORK_ROOT="$PWD", HADOOP_USER_NAME="yarn", 
> AGENT_LOG_ROOT="$LOG_DIRS", PYTHONPATH="./infra/agent/slider-agent/", 
> SLIDER_PASSPHRASE="DEV"]} failed
> 14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Unregistering component 
> container_1410935367006_0001_01_000002
> 14/09/17 07:02:47 INFO state.AppState: RoleStatus{name='HBASE_REGIONSERVER', 
> key=2, desired=1, actual=0, requested=0, releasing=0, failed=1, started=1, 
> startFailed=0, completed=0, failureMessage='Failure 
> container_1410935367006_0001_01_000002 on host c6403.ambari.apache.org: 
> http://c6402.ambari.apache.org:19888/jobhistory/logs/c6403.ambari.apache.org:45454/container_1410935367006_0001_01_000002/ctx/yarn'}
> 14/09/17 07:02:47 INFO state.AppState: HBASE_REGIONSERVER: Asking for 1 more 
> nodes(s) for a total of 1
> 14/09/17 07:02:47 INFO state.RoleHistory: There're 1 nodes to consider for 
> HBASE_REGIONSERVER
> 14/09/17 07:02:47 INFO state.OutstandingRequest: Submitting request for 
> container on c6403.ambari.apache.org
> 14/09/17 07:02:47 INFO state.AppState: Container ask is 
> Capability[<memory:256, vCores:1>]Priority[2]
> 14/09/17 07:02:47 INFO state.AppState: RoleStatus{name='HBASE_MASTER', key=1, 
> desired=1, actual=1, requested=0, releasing=0, failed=0, started=1, 
> startFailed=0, completed=0, failureMessage=''}
> 14/09/17 07:02:47 INFO util.RackResolver: Resolved c6403.ambari.apache.org to 
> /default-rack
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SLIDER-439) RM never fulfilled Slider AM's container request after NM died on a node where HRegionServer was running

Reply via email to