[ 
https://issues.apache.org/jira/browse/AURORA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235548#comment-16235548
 ] 

Stephan Erb commented on AURORA-1955:
-------------------------------------

Patch has been committed to master.

> thermos should exit on irrecoverable errors to avoid zombies
> ------------------------------------------------------------
>
>                 Key: AURORA-1955
>                 URL: https://issues.apache.org/jira/browse/AURORA-1955
>             Project: Aurora
>          Issue Type: Bug
>          Components: Thermos
>            Reporter: Mohit Jaggi
>            Assignee: Stephan Erb
>            Priority: Major
>
> We found several zombie executors on a cluster. Thermos logs indicate 
> reaching system limits while trying to shutdown(?). Mesos agent is unable to 
> get status of this container from docker daemon (docker inspect fails). 
> Shouldn't thermos exit in such a case?
> {code}
>  22 WARNING: Your kernel does not support swap limit capabilities, memory 
> limited without swap.
>  23 twitter.common.app debug: Initializing: twitter.common.log (Logging 
> subsystem.)
>  24 Writing log files to disk in /mnt/mesos/sandbox
>  25 I1023 19:04:32.261165     7 exec.cpp:162] Version: 1.2.0
>  26 I1023 19:04:32.264870    42 exec.cpp:237] Executor registered on agent 
> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>  27 Writing log files to disk in /mnt/mesos/sandbox
>  28 Traceback (most recent call last):
>  29   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 1    26, in _excepting_run
>  30     self.__real_run(*args, **kw)
>  31   File "apache/thermos/monitoring/resource.py", line 243, in run
>  32   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
>  lin    e 79, in wait
>  33     thread.start()
>  34   File "/usr/lib/python2.7/threading.py", line 745, in start
>  35     _start_new_thread(self.__bootstrap, ())
>  36 thread.error: can't start new thread
>  37 ERROR] Failed to stop health checkers:
>  38 ERROR] Traceback (most recent call last):
>  39   File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>  40     propagate_deadline(self._chained_checker.stop, 
> timeout=self.STOP_TIMEOUT)
>  41   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>  42     return deadline(*args, daemon=True, propagate=True, **kw)
>  43   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 6    1, in deadline
>  44     AnonymousThread().start()
>  45   File "/usr/lib/python2.7/threading.py", line 745, in start
>  46     _start_new_thread(self.__bootstrap, ())
>  47 error: can't start new thread
> 48
>  49 ERROR] Failed to stop runner:
> 50 ERROR] Traceback (most recent call last):
>  51   File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>  52     propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>  53   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>  54     return deadline(*args, daemon=True, propagate=True, **kw)
>  55   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 6    1, in deadline
>  56     AnonymousThread().start()
>  57   File "/usr/lib/python2.7/threading.py", line 745, in start
>  58     _start_new_thread(self.__bootstrap, ())
>  59 error: can't start new thread
>  60
>  61 Traceback (most recent call last):
>  62   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 1    26, in _excepting_run
>  63     self.__real_run(*args, **kw)
>  64   File "apache/aurora/executor/status_manager.py", line 62, in run
>  65   File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>  66   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
>  line 5    6, in defer
>  67     deferred.start()
>  68   File "/usr/lib/python2.7/threading.py", line 745, in start
>  69     _start_new_thread(self.__bootstrap, ())
>  70 thread.error: can't start new thread
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to