Thanks! That was helpful. (Strangely) As it turns out, the container is released (and cleaned up) even before the STOP command is queued. Some more logs:
Node Manager: --------------------- 2016-07-07 15:50:14,148 [AmExecutor-006] INFO state.AppState - Role ConnectD flexed from 2 to 1 2016-07-07 15:50:14,148 [AmExecutor-006] WARN state.AppState - Resource requirements of ConnectD normalized from <memory:2000, vCores:1> to <memory:2048, vCores:1> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO state.AppState - Resetting failure count of slider-appmaster; was 0 2016-07-07 15:50:14,148 [AmExecutor-006] INFO state.AppState - Resetting failure count of ConnectD; was 0 2016-07-07 15:50:14,148 [AmExecutor-006] INFO state.RoleHistory - Resetting failure history 2016-07-07 15:50:14,148 [AmExecutor-006] INFO state.AppState - Reviewing RoleStatus{name='ConnectD', key=1, desired=1, actual=2, requested=0, releasing=0, failed=0, startFailed=0, started=2, completed=0, totalRequested=2, preempted=0, nodeFailed=0, failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:2048, vCores:1>, isAntiAffinePlacement=false, failureMessage='', providerRole=ProviderRole{name='ConnectD', id=1, placementPolicy=1, nodeFailureThreshold=3, placementTimeoutSeconds=30, labelExpression='null'}} : 2016-07-07 15:50:14,148 [AmExecutor-006] INFO state.AppState - ConnectD: Asking for 1 fewer node(s) for a total of 1 2016-07-07 15:50:14,150 [AmExecutor-006] INFO state.AppState - Targeting for release: RoleInstance{role='ConnectD', id='container_1467829690678_0031_01_000003', container=ContainerID=container_1467829690678_0031_01_000003 nodeID=192.168.1.195:58591 http=192.168.1.195:8042 priority=1073741825 resource=<memory:2048, vCores:1>, createTime=1467931727650, startTime=1467931727656, released=false, roleId=1, host=192.168.1.195, hostURL=http://192.168.1.195:8042, state=3, placement=null, exitCode=0, command='python ./infra/agent/slider-agent/agent/main.py --label container_1467829690678_0031_01_000003___ConnectD --zk-quorum localhost:2181 --zk-reg-path /registry/users/sarthakk/services/org-apache-slider/kc > <LOG_DIR>/slider-agent.out 2>&1 ; ', diagnostics='null', output=null, environment=[LANGUAGE="en_US.UTF-8", HADOOP_USER_NAME="sarthakk", PYTHONPATH="./infra/agent/slider-agent/", AGENT_LOG_ROOT="<LOG_DIR>", SLIDER_PASSPHRASE="QhURcbCISl0zGwdgkbpbIBKAlIJOmsgKkRgmvuq5ytVDlguFa0", LC_ALL="en_US.UTF-8", AGENT_WORK_ROOT="$PWD", LANG="en_US.UTF-8"]} 2016-07-07 15:50:14,975 [AMRM Callback Handler Thread] INFO appmaster.SliderAppMaster - onContainersCompleted([1] 2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO appmaster.SliderAppMaster - Container Completion for containerID=container_1467829690678_0031_01_000003, state=COMPLETE, exitStatus=-100, diagnostics=Container released by application 2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO state.AppState - Container was queued for release : container_1467829690678_0031_01_000003 2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO state.AppState - decrementing role count for role ConnectD to 1; releasing=0, completed=1 2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO state.RoleHistory - Finished container for node 1, released=true, shortlived=false 2016-07-07 15:50:14,987 [AMRM Callback Handler Thread] INFO state.AppState - Removing node ID container_1467829690678_0031_01_000003 2016-07-07 15:50:14,988 [AMRM Callback Handler Thread] INFO agent.AgentProviderService - Removing component status for label container_1467829690678_0031_01_000003___ConnectD 2016-07-07 15:50:14,988 [AmExecutor-006] INFO appmaster.SliderAppMaster - Unregistering component container_1467829690678_0031_01_000003 2016-07-07 15:50:14,989 [AmExecutor-006] INFO state.AppState - Reviewing RoleStatus{name='ConnectD', key=1, desired=1, actual=1, requested=0, releasing=0, failed=0, startFailed=0, started=2, completed=1, totalRequested=2, preempted=0, nodeFailed=0, failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:2048, vCores:1>, isAntiAffinePlacement=false, failureMessage='', providerRole=ProviderRole{name='ConnectD', id=1, placementPolicy=1, nodeFailureThreshold=3, placementTimeoutSeconds=30, labelExpression='null'}} : 2016-07-07 15:50:23,715 [430377746@qtp-1726515679-4] INFO agent.AgentProviderService - Sending terminate signal to completed container (still heartbeating): container_1467829690678_0031_01_000003___ConnectD 2016-07-07 16:12:43,891 [AmExecutor-006] INFO state.AppState - Resetting failure count of slider-appmaster; was 0 2016-07-07 16:12:43,892 [AmExecutor-006] INFO state.AppState - Resetting failure count of ConnectD; was 0 2016-07-07 16:12:43,892 [AmExecutor-006] INFO state.RoleHistory - Resetting failure history slider-agent Logs: --------------------- INFO 2016-07-07 15:50:23,720 ActionQueue.py:170 - Running command: {u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName': u'ConnectD', u'hostname': u'192.168.1.195', u'hostLevelParams': {u'java_home': u'${JAVA_HOME}', u'container_id': u'container_1467829690678_0031_01_000003'}, u'commandType': u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart': u'false'}, u'serviceName': u'kc', u'role': u'ConnectD', u'commandParams': {u'record_config': u'true', u'service_package_folder': u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script': u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout': u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId': u'4-1', u'containers': [], u'configurations': {u'global': {u'security_enabled': u'false', u'app_container_id': u'container_1467829690678_0031_01_000003', u'listen_port': u'51670', u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir': u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir': u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2', u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir': u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk', u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'}, u'server': {}}} INFO 2016-07-07 15:50:23,720 CustomServiceOrchestrator.py:114 - Base dir: /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package ERROR 2016-07-07 15:50:23,720 CustomServiceOrchestrator.py:169 - Caught an exception while executing command: <class 'AgentException.AgentException'>: 'Script /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package/scripts/kafka.py does not exist' Traceback (most recent call last): File "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py", line 115, in runCommand script_path = self.resolve_script_path(self.base_dir, script, script_type) File "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py", line 199, in resolve_script_path raise AgentException(message) AgentException: 'Script /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package/scripts/kafka.py does not exist' INFO 2016-07-07 15:50:23,721 ActionQueue.py:188 - Stop command received INFO 2016-07-07 15:50:23,824 Controller.py:223 - Terminate agent command received from AM, stopping the agent ... INFO 2016-07-07 15:50:23,824 ProcessHelper.py:39 - Removing pid file WARNING 2016-07-07 15:50:23,825 ProcessHelper.py:44 - Unable to remove pid file: [Errno 2] No such file or directory: '/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/infra/run/agent.pid' INFO 2016-07-07 15:50:23,825 ProcessHelper.py:46 - Removing temp files TL; DR: The callback to signal completion (and un-registration) of the container comes up around 15:50:14 whereas the call to CustomServiceOrchestrator (triggered by the terminate signal from the slider AM) is received around 15:50:23. I guess I am missing how the synchronization between termination of a container and cleanup of all its resources is handled by Slider (or whether this is handled by YARN itself). - Sarthak On Thu, Jul 7, 2016 at 2:48 PM, Billie Rinaldi <billie.rina...@gmail.com> wrote: > If you look for the container ID in the nodemanager log on the host where > the container was running, you should be able to see when the container > stopped and was cleaned up. Looks like it even logs when it deletes the > container directories. > > On Thu, Jul 7, 2016 at 2:04 PM, Sarthak Kukreti <skuk...@ncsu.edu> wrote: > >> kafka,py is still present in the filecache directory: its just the >> "container_1467829690678_0022_01_000003" directory that seems to be >> deleted before the runCommand() call >> >> - Sarthak >> >> On Thu, Jul 7, 2016 at 12:35 PM, Billie Rinaldi >> <billie.rina...@gmail.com> wrote: >> > I think that >> > >> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition >> > is linked to >> > >> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip, >> > so those are actually the same directory. I am not sure why it is saying >> > kafka.py does not exist when stopping the container; it definitely should >> > not clean up that directory while a container is still running. Can you >> > verify that app/definition/package/scripts/kafka.py exists for one of the >> > containers that is running? >> > >> > On Thu, Jul 7, 2016 at 11:50 AM, Sarthak Kukreti <skuk...@ncsu.edu> >> wrote: >> > >> >> Hello! >> >> >> >> I am trying to use Slider to distribute an application over a YARN >> >> cluster. While attempting to use "slider flex" to decrease the number >> >> of containers allocated for the application (using the kafka >> >> app-package as reference), I came across the following error: >> >> >> >> ERROR 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:169 - >> >> Caught an exception while executing command: <class >> >> 'AgentException.AgentException'>: 'Script >> >> >> >> >> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py >> >> does not exist' >> >> Traceback (most recent call last): >> >> File >> >> >> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py", >> >> line 115, in runCommand >> >> script_path = self.resolve_script_path(self.base_dir, script, >> >> script_type) >> >> File >> >> >> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py", >> >> line 199, in resolve_script_path >> >> raise AgentException(message) >> >> AgentException: 'Script >> >> >> >> >> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py >> >> does not exist' >> >> >> >> >> >> (Seems like the directory is cleared up before the command) >> >> >> >> Additionally, I tried adding debug prints in the >> >> CustomServiceOrchestrator to see what base directory is used for >> >> invoking the script and found that the base directories for STATUS and >> >> STOP command differ: >> >> >> >> STATUS command: >> >> >> >> INFO 2016-07-07 10:56:31,323 AgentToggleLogger.py:40 - Adding >> >> STATUS_COMMAND for service kc of cluster kc to the queue. >> >> INFO 2016-07-07 10:56:31,327 CustomServiceOrchestrator.py:114 - Base >> >> dir: >> >> >> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip/package >> >> >> >> >> >> STOP command: >> >> >> >> INFO 2016-07-07 10:57:36,455 AgentToggleLogger.py:40 - Adding >> >> EXECUTION_COMMAND for service kc of cluster kc to the queue. >> >> INFO 2016-07-07 10:57:36,456 Controller.py:251 - Attempting to >> >> gracefully stop the application ... >> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:134 - Package received: >> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:140 - Executing command >> >> with id = 4-1 for role = Hello of cluster kc >> >> INFO 2016-07-07 10:57:36,460 ActionQueue.py:170 - Running command: >> >> {u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName': >> >> u'Hello', u'hostname': u'192.168.1.195', u'hostLevelParams': >> >> {u'java_home': >> >> u'/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/', >> >> u'container_id': u'container_1467829690678_0022_01_000003'}, >> >> u'commandType': u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart': >> >> u'false'}, u'serviceName': u'kc', u'role': u'Hello', u'commandParams': >> >> {u'record_config': u'true', u'service_package_folder': >> >> u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script': >> >> u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout': >> >> u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId': >> >> u'4-1', u'containers': [], u'configurations': {u'global': >> >> {u'security_enabled': u'false', u'app_container_id': >> >> u'container_1467829690678_0022_01_000003', u'listen_port': u'52508', >> >> u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir': >> >> u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir': >> >> u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2', >> >> u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir': >> >> u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk', >> >> u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'}, >> >> u'server': {}}} >> >> INFO 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:114 - Base >> >> dir: >> >> >> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package >> >> >> >> For some reason, the STOP command attempts to pick up the script from >> >> the container specific location, where the STATUS command goes through >> >> an entirely different path (I am not sure though if this is the cause >> >> of the issue). Any more pointers to debug this would be really >> >> helpful. >> >> >> >> (For reference, platform: OS X, Python 2.7.11) >> >> >> >> Thank you >> >> Sarthak >> >> >>