Thanks! That was helpful. (Strangely) As it turns out, the container
is released (and cleaned up) even before the STOP command is queued.
Some more logs:

Node Manager:
---------------------
2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState - Role
ConnectD flexed from 2 to 1
2016-07-07 15:50:14,148 [AmExecutor-006] WARN  state.AppState -
Resource requirements of ConnectD normalized from <memory:2000,
vCores:1> to <memory:2048, vCores:1>
2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
Resetting failure count of slider-appmaster; was 0
2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
Resetting failure count of ConnectD; was 0
2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.RoleHistory -
Resetting failure history
2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
Reviewing RoleStatus{name='ConnectD', key=1, desired=1, actual=2,
requested=0, releasing=0, failed=0, startFailed=0, started=2,
completed=0, totalRequested=2, preempted=0, nodeFailed=0,
failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:2048,
vCores:1>, isAntiAffinePlacement=false, failureMessage='',
providerRole=ProviderRole{name='ConnectD', id=1, placementPolicy=1,
nodeFailureThreshold=3, placementTimeoutSeconds=30,
labelExpression='null'}} :
2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
ConnectD: Asking for 1 fewer node(s) for a total of 1
2016-07-07 15:50:14,150 [AmExecutor-006] INFO  state.AppState -
Targeting for release: RoleInstance{role='ConnectD',
id='container_1467829690678_0031_01_000003',
container=ContainerID=container_1467829690678_0031_01_000003
nodeID=192.168.1.195:58591 http=192.168.1.195:8042 priority=1073741825
resource=<memory:2048, vCores:1>, createTime=1467931727650,
startTime=1467931727656, released=false, roleId=1, host=192.168.1.195,
hostURL=http://192.168.1.195:8042, state=3, placement=null,
exitCode=0, command='python ./infra/agent/slider-agent/agent/main.py
--label container_1467829690678_0031_01_000003___ConnectD --zk-quorum
localhost:2181 --zk-reg-path
/registry/users/sarthakk/services/org-apache-slider/kc >
<LOG_DIR>/slider-agent.out 2>&1 ; ', diagnostics='null', output=null,
environment=[LANGUAGE="en_US.UTF-8", HADOOP_USER_NAME="sarthakk",
PYTHONPATH="./infra/agent/slider-agent/", AGENT_LOG_ROOT="<LOG_DIR>",
SLIDER_PASSPHRASE="QhURcbCISl0zGwdgkbpbIBKAlIJOmsgKkRgmvuq5ytVDlguFa0",
LC_ALL="en_US.UTF-8", AGENT_WORK_ROOT="$PWD", LANG="en_US.UTF-8"]}

2016-07-07 15:50:14,975 [AMRM Callback Handler Thread] INFO
appmaster.SliderAppMaster - onContainersCompleted([1]
2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
appmaster.SliderAppMaster - Container Completion for
containerID=container_1467829690678_0031_01_000003, state=COMPLETE,
exitStatus=-100, diagnostics=Container released by application
2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
state.AppState - Container was queued for release :
container_1467829690678_0031_01_000003
2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
state.AppState - decrementing role count for role ConnectD to 1;
releasing=0, completed=1
2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
state.RoleHistory - Finished container for node 1, released=true,
shortlived=false
2016-07-07 15:50:14,987 [AMRM Callback Handler Thread] INFO
state.AppState - Removing node ID
container_1467829690678_0031_01_000003
2016-07-07 15:50:14,988 [AMRM Callback Handler Thread] INFO
agent.AgentProviderService - Removing component status for label
container_1467829690678_0031_01_000003___ConnectD

2016-07-07 15:50:14,988 [AmExecutor-006] INFO
appmaster.SliderAppMaster - Unregistering component
container_1467829690678_0031_01_000003
2016-07-07 15:50:14,989 [AmExecutor-006] INFO  state.AppState -
Reviewing RoleStatus{name='ConnectD', key=1, desired=1, actual=1,
requested=0, releasing=0, failed=0, startFailed=0, started=2,
completed=1, totalRequested=2, preempted=0, nodeFailed=0,
failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:2048,
vCores:1>, isAntiAffinePlacement=false, failureMessage='',
providerRole=ProviderRole{name='ConnectD', id=1, placementPolicy=1,
nodeFailureThreshold=3, placementTimeoutSeconds=30,
labelExpression='null'}} :
2016-07-07 15:50:23,715 [430377746@qtp-1726515679-4] INFO
agent.AgentProviderService - Sending terminate signal to completed
container (still heartbeating):
container_1467829690678_0031_01_000003___ConnectD
2016-07-07 16:12:43,891 [AmExecutor-006] INFO  state.AppState -
Resetting failure count of slider-appmaster; was 0
2016-07-07 16:12:43,892 [AmExecutor-006] INFO  state.AppState -
Resetting failure count of ConnectD; was 0
2016-07-07 16:12:43,892 [AmExecutor-006] INFO  state.RoleHistory -
Resetting failure history


slider-agent Logs:
---------------------
INFO 2016-07-07 15:50:23,720 ActionQueue.py:170 - Running command:
{u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
u'ConnectD', u'hostname': u'192.168.1.195', u'hostLevelParams':
{u'java_home': u'${JAVA_HOME}', u'container_id':
u'container_1467829690678_0031_01_000003'}, u'commandType':
u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart': u'false'},
u'serviceName': u'kc', u'role': u'ConnectD', u'commandParams':
{u'record_config': u'true', u'service_package_folder':
u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
u'4-1', u'containers': [], u'configurations': {u'global':
{u'security_enabled': u'false', u'app_container_id':
u'container_1467829690678_0031_01_000003', u'listen_port': u'51670',
u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir':
u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
u'server': {}}}
INFO 2016-07-07 15:50:23,720 CustomServiceOrchestrator.py:114 - Base
dir: 
/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package

ERROR 2016-07-07 15:50:23,720 CustomServiceOrchestrator.py:169 -
Caught an exception while executing command: <class
'AgentException.AgentException'>: 'Script
/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package/scripts/kafka.py
does not exist'
Traceback (most recent call last):
  File 
"/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
line 115, in runCommand
    script_path = self.resolve_script_path(self.base_dir, script, script_type)
  File 
"/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
line 199, in resolve_script_path
    raise AgentException(message)
AgentException: 'Script
/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package/scripts/kafka.py
does not exist'
INFO 2016-07-07 15:50:23,721 ActionQueue.py:188 - Stop command received
INFO 2016-07-07 15:50:23,824 Controller.py:223 - Terminate agent
command received from AM, stopping the agent ...
INFO 2016-07-07 15:50:23,824 ProcessHelper.py:39 - Removing pid file
WARNING 2016-07-07 15:50:23,825 ProcessHelper.py:44 - Unable to remove
pid file: [Errno 2] No such file or directory:
'/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/infra/run/agent.pid'
INFO 2016-07-07 15:50:23,825 ProcessHelper.py:46 - Removing temp files

TL; DR: The callback to signal completion (and un-registration) of the
container comes up around 15:50:14 whereas the call to
CustomServiceOrchestrator (triggered by the terminate signal from the
slider AM) is received around 15:50:23. I guess I am missing how the
synchronization between termination of a container and cleanup of all
its resources is handled by Slider (or whether this is handled by YARN
itself).

- Sarthak

On Thu, Jul 7, 2016 at 2:48 PM, Billie Rinaldi <billie.rina...@gmail.com> wrote:
> If you look for the container ID in the nodemanager log on the host where
> the container was running, you should be able to see when the container
> stopped and was cleaned up. Looks like it even logs when it deletes the
> container directories.
>
> On Thu, Jul 7, 2016 at 2:04 PM, Sarthak Kukreti <skuk...@ncsu.edu> wrote:
>
>> kafka,py is still present in the filecache directory: its just the
>> "container_1467829690678_0022_01_000003" directory that seems to be
>> deleted before the runCommand() call
>>
>> - Sarthak
>>
>> On Thu, Jul 7, 2016 at 12:35 PM, Billie Rinaldi
>> <billie.rina...@gmail.com> wrote:
>> > I think that
>> >
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition
>> > is linked to
>> >
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip,
>> > so those are actually the same directory. I am not sure why it is saying
>> > kafka.py does not exist when stopping the container; it definitely should
>> > not clean up that directory while a container is still running. Can you
>> > verify that app/definition/package/scripts/kafka.py exists for one of the
>> > containers that is running?
>> >
>> > On Thu, Jul 7, 2016 at 11:50 AM, Sarthak Kukreti <skuk...@ncsu.edu>
>> wrote:
>> >
>> >> Hello!
>> >>
>> >> I am trying to use Slider to distribute an application over a YARN
>> >> cluster. While attempting to use "slider flex" to decrease the number
>> >> of containers allocated for the application (using the kafka
>> >> app-package as reference), I came across the following error:
>> >>
>> >> ERROR 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:169 -
>> >> Caught an exception while executing command: <class
>> >> 'AgentException.AgentException'>: 'Script
>> >>
>> >>
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
>> >> does not exist'
>> >> Traceback (most recent call last):
>> >>   File
>> >>
>> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
>> >> line 115, in runCommand
>> >>     script_path = self.resolve_script_path(self.base_dir, script,
>> >> script_type)
>> >>   File
>> >>
>> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
>> >> line 199, in resolve_script_path
>> >>     raise AgentException(message)
>> >> AgentException: 'Script
>> >>
>> >>
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
>> >> does not exist'
>> >>
>> >>
>> >> (Seems like the directory is cleared up before the command)
>> >>
>> >> Additionally, I tried adding debug prints in the
>> >> CustomServiceOrchestrator to see what base directory is used for
>> >> invoking the script and found that the base directories for STATUS and
>> >> STOP command differ:
>> >>
>> >> STATUS command:
>> >>
>> >> INFO 2016-07-07 10:56:31,323 AgentToggleLogger.py:40 - Adding
>> >> STATUS_COMMAND for service kc of cluster kc to the queue.
>> >> INFO 2016-07-07 10:56:31,327 CustomServiceOrchestrator.py:114 - Base
>> >> dir:
>> >>
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip/package
>> >>
>> >>
>> >> STOP command:
>> >>
>> >> INFO 2016-07-07 10:57:36,455 AgentToggleLogger.py:40 - Adding
>> >> EXECUTION_COMMAND for service kc of cluster kc to the queue.
>> >> INFO 2016-07-07 10:57:36,456 Controller.py:251 - Attempting to
>> >> gracefully stop the application ...
>> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:134 - Package received:
>> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:140 - Executing command
>> >> with id = 4-1 for role = Hello of cluster kc
>> >> INFO 2016-07-07 10:57:36,460 ActionQueue.py:170 - Running command:
>> >> {u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
>> >> u'Hello', u'hostname': u'192.168.1.195', u'hostLevelParams':
>> >> {u'java_home':
>> >> u'/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/',
>> >> u'container_id': u'container_1467829690678_0022_01_000003'},
>> >> u'commandType': u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart':
>> >> u'false'}, u'serviceName': u'kc', u'role': u'Hello', u'commandParams':
>> >> {u'record_config': u'true', u'service_package_folder':
>> >> u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
>> >> u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
>> >> u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
>> >> u'4-1', u'containers': [], u'configurations': {u'global':
>> >> {u'security_enabled': u'false', u'app_container_id':
>> >> u'container_1467829690678_0022_01_000003', u'listen_port': u'52508',
>> >> u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
>> >> u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
>> >> u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
>> >> u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir':
>> >> u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
>> >> u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
>> >> u'server': {}}}
>> >> INFO 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:114 - Base
>> >> dir:
>> >>
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package
>> >>
>> >> For some reason, the STOP command attempts to pick up the script from
>> >> the container specific location, where the STATUS command goes through
>> >> an entirely different path (I am not sure though if this is the cause
>> >> of the issue). Any more pointers to debug this would be really
>> >> helpful.
>> >>
>> >> (For reference, platform: OS X, Python 2.7.11)
>> >>
>> >> Thank you
>> >> Sarthak
>> >>
>>

Reply via email to