alexius ludeman created AURORA-1362:
---------------------------------------

             Summary: thermos_executor stop responding to commands
                 Key: AURORA-1362
                 URL: https://issues.apache.org/jira/browse/AURORA-1362
             Project: Aurora
          Issue Type: Story
          Components: Executor
    Affects Versions: 0.7.0
            Reporter: alexius ludeman


if 
https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/sandbox.py
 raises any exceptions then thermos_executor continues to run but no longer 
responds to any commands.

Based on conversation with Maxim on #aurora, the correct action is likely to 
catch 
https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/aurora_executor.py#L122
 and exit appropriately.

To reproduce attempt to launch as a non-exist user on the slave, or causing a 
chmod/chown failure which will raise CreationError.  Once this occurs one will 
see that aurora UI never passes state STARTING.  When 
transient_task_state_timeout is reached then the task state moves to LOST.  
thermos_executor will be still running on the slave and mesos considers the 
task still active and state is STARTING.  Unfortunately GC will be unable to 
clean it up as it does not know about it.  At this point there is nothing to 
recover this orphaned thermos_executor short of killing it by hand.

Sorry the line numbers will not match due to local changes, but the stacktrace 
should be accurate.
https://gist.github.com/lexinator/ca95b249c7cb25575395




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to