[ 
https://issues.apache.org/jira/browse/SLIDER-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086655#comment-14086655
 ] 

Steve Loughran commented on SLIDER-270:
---------------------------------------

...also occurs in branch-2.5; assume error in our code first

AM is releasing requested containers, but the _003 container isn't triggering a 
callback
{code}
2014-08-05 18:20:58,915 [IPC Server handler 0 on 64048] INFO  state.AppState 
(AppState.java:buildRoleRequirementsFromResources(663)) - Role worker flexed 
from 3 to 1
2014-08-05 18:20:58,915 [IPC Server handler 0 on 64048] WARN  state.AppState 
(AppState.java:buildRoleRequirementsFromResources(660)) - Role REST has 0 
instances specified
2014-08-05 18:20:58,916 [IPC Server handler 0 on 64048] DEBUG 
appmaster.SliderAppMaster 
(SliderAppMaster.java:reviewRequestAndReleaseNodes(1101)) - in 
reviewRequestAndReleaseNodes()
2014-08-05 18:20:58,916 [IPC Server handler 0 on 64048] DEBUG state.AppState 
(AppState.java:reviewRequestAndReleaseNodes(1369)) - in 
reviewRequestAndReleaseNodes()
2014-08-05 18:20:58,916 [IPC Server handler 0 on 64048] INFO  state.AppState 
(AppState.java:reviewOneRole(1419)) - RoleStatus{name='master', key=2, 
desired=1, actual=1, requested=0, releasing=0, failed=0, started=1, 
startFailed=0, completed=0, failureMessage=''}
2014-08-05 18:20:58,916 [IPC Server handler 0 on 64048] INFO  state.AppState 
(AppState.java:reviewOneRole(1419)) - RoleStatus{name='worker', key=1, 
desired=1, actual=3, requested=0, releasing=0, failed=0, started=4, 
startFailed=0, completed=1, failureMessage=''}
2014-08-05 18:20:58,916 [IPC Server handler 0 on 64048] INFO  state.AppState 
(AppState.java:reviewOneRole(1441)) - worker: Asking for 2 fewer node(s) for a 
total of 1
2014-08-05 18:20:58,916 [IPC Server handler 0 on 64048] DEBUG state.NodeMap 
(NodeMap.java:findNodesForRelease(120)) - searching for 2 nodes with candidate 
set size 1
2014-08-05 18:20:58,916 [IPC Server handler 0 on 64048] DEBUG state.NodeMap 
(NodeMap.java:findNodesForRelease(126)) - Node 240.0.0.1 load=3
2014-08-05 18:20:58,917 [IPC Server handler 0 on 64048] DEBUG state.NodeMap 
(NodeMap.java:findNodesForRelease(135)) - Push 240.0.0.1 #0
2014-08-05 18:20:58,917 [IPC Server handler 0 on 64048] DEBUG state.NodeMap 
(NodeMap.java:findNodesForRelease(135)) - Push 240.0.0.1 #1
2014-08-05 18:20:58,917 [IPC Server handler 0 on 64048] INFO  state.AppState 
(AppState.java:reviewOneRole(1419)) - RoleStatus{name='REST', key=3, desired=0, 
actual=0, requested=0, releasing=0, failed=0, started=0, startFailed=0, 
completed=0, failureMessage=''}
2014-08-05 18:20:58,917 [IPC Server handler 0 on 64048] DEBUG 
appmaster.AsyncRMOperationHandler 
(AsyncRMOperationHandler.java:releaseAssignedContainer(42)) - Releasing 
container container_1407259223803_0001_01_000003
2014-08-05 18:20:58,917 [IPC Server handler 0 on 64048] DEBUG 
appmaster.AsyncRMOperationHandler 
(AsyncRMOperationHandler.java:releaseAssignedContainer(42)) - Releasing 
container container_1407259223803_0001_01_000004
2014-08-05 18:20:59,922 [AMRM Callback Handler Thread] INFO  
appmaster.SliderAppMaster (SliderAppMaster.java:onContainersCompleted(1045)) - 
onContainersCompleted([1]
2014-08-05 18:20:59,922 [AMRM Callback Handler Thread] INFO  
appmaster.SliderAppMaster (SliderAppMaster.java:onContainersCompleted(1048)) - 
Container Completion for containerID=container_1407259223803_0001_01_000004, 
state=COMPLETE, exitStatus=-100, diagnostics=Container released by application
2014-08-05 18:20:59,922 [AMRM Callback Handler Thread] INFO  state.AppState 
(AppState.java:onCompletedNode(1175)) - Container was queued for release
2014-08-05 18:20:59,922 [AMRM Callback Handler Thread] INFO  state.AppState 
(AppState.java:onCompletedNode(1178)) - decrementing role count for role worker
2014-08-05 18:20:59,928 [AMRM Callback Handler Thread] DEBUG 
appmaster.SliderAppMaster 
(SliderAppMaster.java:reviewRequestAndReleaseNodes(1101)) - in 
reviewRequestAndReleaseNodes()
2014-08-05 18:20:59,928 [AMRM Callback Handler Thread] DEBUG state.AppState 
(AppState.java:reviewRequestAndReleaseNodes(1369)) - in 
reviewRequestAndReleaseNodes()
2014-08-05 18:20:59,928 [AMRM Callback Handler Thread] INFO  state.AppState 
(AppState.java:reviewOneRole(1419)) - RoleStatus{name='master', key=2, 
desired=1, actual=1, requested=0, releasing=0, failed=0, started=1, 
startFailed=0, completed=0, failureMessage=''}
2014-08-05 18:20:59,928 [AMRM Callback Handler Thread] INFO  state.AppState 
(AppState.java:reviewOneRole(1419)) - RoleStatus{name='worker', key=1, 
desired=1, actual=2, requested=0, releasing=1, failed=0, started=4, 
startFailed=0, completed=2, failureMessage=''}
2014-08-05 18:20:59,928 [AMRM Callback Handler Thread] INFO  state.AppState 
(AppState.java:reviewOneRole(1419)) - RoleStatus{name='REST', key=3, desired=0, 
actual=0, requested=0, releasing=0, failed=0, started=0, startFailed=0, 
completed=0, failureMessage=''}
{code}

because the RM doesn't know of it
{code}
2014-08-05 18:20:59,309 [ResourceManager Event Processor] INFO  
fifo.FifoScheduler (FifoScheduler.java:containerCompleted(837)) - Null 
container completed...
2014-08-05 18:20:59,920 [IPC Server handler 8 on 64035] WARN  
resourcemanager.RMAuditLogger (RMAuditLogger.java:logFailure(209)) - 
USER=stevel  IP=240.0.0.OPERATION=AM Released Container      
TARGET=FifoScheduler    RESULT=FAILURE  DESCRIPTION=Trying to release container 
not owned by app or with invalid id     PERMISSIONS=Unauthorized access or 
invalid container    APPID=application_1407259223803_0001    
CONTAINERID=container_1407259223803_0001_01_000003
2014-08-05 18:20:59,920 [IPC Server handler 8 on 64035] INFO  
fifo.FifoScheduler (FifoScheduler.java:containerCompleted(837)) - Null 
container completed...
2014-08-05 18:20:59,920 [IPC Server handler 8 on 64035] INFO  
rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(376)) - 
container_1407259223803_0001_01_000004 Container Transitioned from RUNNING to 
RELEASED
2014-08-05 18:20:59,921 [IPC Server handler 8 on 64035] INFO  
fica.FiCaSchedulerApp (FiCaSchedulerApp.java:containerCompleted(95)) - 
Completed container: container_1407259223803_0001_01_000004 in state: RELEASED 
event:RELEASED
2014-08-05 18:20:59,921 [IPC Server handler 8 on 64035] INFO  
resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(100)) - 
USER=stevel  IP=240.0.0.OPERATION=AM Released Container      
TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1407259223803_0001    
CONTAINERID=container_1407259223803_0001_01_000004
2014-08-05 18:20:59,921 [IPC Server handler 8 on 64035] INFO  
scheduler.SchedulerNode (SchedulerNode.java:releaseContainer(194)) - Released 
container container_1407259223803_0001_01_000004 of capacity <memory:256, 
vCores:1> on host 240.0.0.1:64039, which currently has 4 containers, 
<memory:1024, vCores:4> used and <memory:3072, vCores:4> available, release 
resources=true
2014-08-05 18:20:59,921 [IPC Server handler 8 on 64035] INFO  
fifo.FifoScheduler (FifoScheduler.java:containerCompleted(868)) - Application 
attempt appattempt_1407259223803_0001_000001 released container 
container_1407259223803_0001_01_000004 on node: host: 240.0.0.1:64039 
#containers=4 available=3072 used=1024 with event: RELEASED
2014-08-05 18:20:59,948 [Thread-1] DEBUG rpc.RpcBinder 
(RpcBinder.java:getProxy(227)) - Connecting to stevel-9.local/240.0.0.1:64048
2014-08-05 18:20:59,948 [Thread-1] DEBUG rpc.RpcBinder 
(RpcBinder.java:connectToServer(122)) - Connecting to Slider AM at 
stevel-9.local/240.0.0.1:64048
2014-08-05 18:20:59,971 [Thread-1] DEBUG test.SliderTestUtils 
(SliderTestUtils.groovy:waitForRoleCount(294)) - Waiting: [worker]: desired: 1; 
actual: 2 
2014-08-05 18:21:00,311 [AsyncDispatcher event handler] INFO  
container.Container (ContainerImpl.java:handle(918)) - Container 
container_1407259223803_0001_01_000004 transitioned from RUNNING to KILLING
2014-08-05 18:21:00,311 [AsyncDispatcher event handler] INFO  
launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(358)) - 
Cleaning up container container_1407259223803_0001_01_000004
2014-08-05 18:21:00,325 [ContainersLauncher #1] WARN  
nodemanager.DefaultContainerExecutor 
(DefaultContainerExecutor.java:launchContainer(208)) - Exit code from container 
container_1407259223803_0001_01_000004 is : 143
2014-08-05 18:21:00,335 [AsyncDispatcher event handler] INFO  
container.Container (ContainerImpl.java:handle(918)) - Container 
container_1407259223803_0001_01_000004 transitioned from KILLING to 
CONTAINER_CLEANEDUP_AFTER_KILL
2014-08-05 18:21:00,336 [DeletionService #2] INFO  
nodemanager.DefaultContainerExecutor 
(DefaultContainerExecutor.java:deleteAsUser(381)) - Deleting absolute path : 
/Users/stevel/Projects/Hortonworks/Projects/slider/slider-providers/hbase/slider-hbase-provider/target/testclusterflexdownmultiple/testclusterflexdownmultiple-localDir-nm-0_0/usercache/stevel/appcache/application_1407259223803_0001/container_1407259223803_0001_01_000004
2014-08-05 18:21:00,336 [AsyncDispatcher event handler] INFO  
nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=stevel     
  OPERATION=Container Finished - Killed   TARGET=ContainerImpl    
RESULT=SUCCESS  APPID=application_1407259223803_0001    
CONTAINERID=container_1407259223803_0001_01_000004
2014-08-05 18:21:00,336 [AsyncDispatcher event handler] INFO  
container.Container (ContainerImpl.java:handle(918)) - Container 
container_1407259223803_0001_01_000004 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2014-08-05 18:21:00,337 [AsyncDispatcher event handler] INFO  
application.Application (ApplicationImpl.java:transition(341)) - Removing 
container_1407259223803_0001_01_000004 from application 
application_1407259223803_0001
2014-08-05 18:21:00,337 [AsyncDispatcher event handler] INFO  
containermanager.AuxServices (AuxServices.java:handle(196)) - Got event 
CONTAINER_STOP for appId application_1407259223803_0001
{code}

We know a full shutdown works once, so assume: the AM isn't getting/processing 
all the container release events and is out of sync with what is actually in 
the cluster.

> Calling flex (down) the second time does not work
> -------------------------------------------------
>
>                 Key: SLIDER-270
>                 URL: https://issues.apache.org/jira/browse/SLIDER-270
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster
>    Affects Versions: Slider 0.50
>            Reporter: Sumit Mohanty
>            Assignee: Steve Loughran
>             Fix For: Slider 0.50
>
>
> From AppMaster log (see below) it looks like that the second command to flex 
> from 2 to 1 did not result in container release.
> {noformat}
> 14/08/04 01:55:18 INFO state.AppState: Role MEMCACHED flexed from 3 to 2
> 14/08/04 01:55:18 INFO state.AppState: RoleStatus{name='MEMCACHED', key=1, 
> desired=2, actual=3, requested=0, releasing=0, failed=0, started=3, 
> startFailed=0, completed=0, failureMessage=''}
> 14/08/04 01:55:18 INFO state.AppState: MEMCACHED: Asking for 1 fewer node(s) 
> for a total of 2
> 14/08/04 01:55:19 INFO appmaster.SliderAppMaster: onContainersCompleted([1]
> 14/08/04 01:55:19 INFO appmaster.SliderAppMaster: Container Completion for 
> containerID=container_1405048900371_0054_01_000004, state=COMPLETE, 
> exitStatus=-100, diagnostics=Container released by application
> 14/08/04 01:55:19 INFO state.AppState: Container was queued for release
> 14/08/04 01:55:19 INFO state.AppState: decrementing role count for role 
> MEMCACHED
> 14/08/04 01:55:19 INFO agent.AgentProviderService: Removing container 
> specific data for container_1405048900371_0054_01_000004
> 14/08/04 01:55:19 INFO agent.AgentProviderService: publishing 
> PublishedConfiguration{description='ComponentInstanceData' entries = 2}
> 14/08/04 01:55:19 INFO state.AppState: RoleStatus{name='MEMCACHED', key=1, 
> desired=2, actual=2, requested=0, releasing=0, failed=0, started=3, 
> startFailed=0, completed=1, failureMessage=''}
> 14/08/04 01:55:45 INFO state.AppState: Role MEMCACHED flexed from 2 to 1
> 14/08/04 01:55:45 INFO state.AppState: RoleStatus{name='MEMCACHED', key=1, 
> desired=1, actual=2, requested=0, releasing=0, failed=0, started=3, 
> startFailed=0, completed=1, failureMessage=''}
> 14/08/04 01:55:45 INFO state.AppState: MEMCACHED: Asking for 1 fewer node(s) 
> for a total of 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to