[
https://issues.apache.org/jira/browse/AURORA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
alexius ludeman closed AURORA-1362.
-----------------------------------
Resolution: Invalid
due to local changes, it made an assumption to raise outside of create()
> thermos_executor stop responding to commands
> --------------------------------------------
>
> Key: AURORA-1362
> URL: https://issues.apache.org/jira/browse/AURORA-1362
> Project: Aurora
> Issue Type: Bug
> Components: Executor
> Affects Versions: 0.7.0
> Reporter: alexius ludeman
>
> if
> https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/sandbox.py
> raises any exceptions then thermos_executor continues to run but no longer
> responds to any commands. It's orphaned and continues to consume resources
> until manually killed.
> Based on conversation with Maxim on #aurora, the correct action is likely to
> catch
> https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/aurora_executor.py#L122
> and exit appropriately.
> To reproduce attempt to launch as a non-exist user on the slave, or causing a
> chmod/chown failure which will raise CreationError. Once this occurs one
> will see that aurora UI never passes state STARTING. When
> transient_task_state_timeout is reached then the task state moves to LOST.
> thermos_executor will be still running on the slave and mesos considers the
> task still active and state is STARTING. Unfortunately GC will be unable to
> clean it up as it does not know about it. At this point there is nothing to
> recover this orphaned thermos_executor short of killing it by hand.
> Sorry the line numbers will not match due to local changes, but the
> stacktrace should be accurate.
> https://gist.github.com/lexinator/ca95b249c7cb25575395
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)