[ 
https://issues.apache.org/jira/browse/AURORA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

alexius ludeman closed AURORA-1362.
-----------------------------------
    Resolution: Invalid

due to local changes, it made an assumption to raise outside of create()

> thermos_executor stop responding to commands
> --------------------------------------------
>
>                 Key: AURORA-1362
>                 URL: https://issues.apache.org/jira/browse/AURORA-1362
>             Project: Aurora
>          Issue Type: Bug
>          Components: Executor
>    Affects Versions: 0.7.0
>            Reporter: alexius ludeman
>
> if 
> https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/sandbox.py
>  raises any exceptions then thermos_executor continues to run but no longer 
> responds to any commands.  It's orphaned and continues to consume resources 
> until manually killed.
> Based on conversation with Maxim on #aurora, the correct action is likely to 
> catch 
> https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/aurora_executor.py#L122
>  and exit appropriately.
> To reproduce attempt to launch as a non-exist user on the slave, or causing a 
> chmod/chown failure which will raise CreationError.  Once this occurs one 
> will see that aurora UI never passes state STARTING.  When 
> transient_task_state_timeout is reached then the task state moves to LOST.  
> thermos_executor will be still running on the slave and mesos considers the 
> task still active and state is STARTING.  Unfortunately GC will be unable to 
> clean it up as it does not know about it.  At this point there is nothing to 
> recover this orphaned thermos_executor short of killing it by hand.
> Sorry the line numbers will not match due to local changes, but the 
> stacktrace should be accurate.
> https://gist.github.com/lexinator/ca95b249c7cb25575395



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to