-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63443/
-----------------------------------------------------------
Review request for Aurora, Bill Farner and Zameer Manji.
Bugs: AURORA-1955
https://issues.apache.org/jira/browse/AURORA-1955
Repository: aurora
Description
-------
This commit consits of two independent parts:
a) ensure we interrupt the main thread when there are unhandled exceptions
b) ensure the main thread of the executor can be interrupted
Diffs
-----
src/main/python/apache/aurora/executor/bin/thermos_executor_main.py
a191cf9eec844035c0f6aa5aed3731a06024c0df
src/main/python/apache/aurora/tools/thermos.py
de20c06cea5bbb45c7a6f5acfeee69289f8e6ad8
src/main/python/apache/aurora/tools/thermos_observer.py
0318f990ac003c0b8925b7eb7359431cdee34f05
src/main/python/apache/thermos/common/excepthook.py PRE-CREATION
src/main/python/apache/thermos/runner/thermos_runner.py
847f51ed2c0e003f1325aa903bd0f0b760acb365
Diff: https://reviews.apache.org/r/63443/diff/1/
Testing
-------
This bug is pretty hard to reproduce and test. I therefore opted for a manual
verification and injected an exception throw shortly before the last statement
of the `AuroraExecutor._shutdown` method. Without this patch, this resulted in
hanging executors on the host. With this patch everything is terminated as
expected.
For details of the suffessful run, please see the executor logs below. Please
note that the `apport.fileutils` is due to Ubuntu messing with its Python
installation. This is not critical.
```
twitter.common.app debug: Initializing: apache.thermos.common.excepthook
(Exception termination handler.)
I1031 15:59:37.188621 25437 exec.cpp:162] Version: 1.2.0
I1031 15:59:37.192201 25429 exec.cpp:237] Executor registered on agent
93259518-14f4-4956-a39c-aa615bff9a5e-S0
Writing log files to disk in
/var/lib/mesos/slaves/93259518-14f4-4956-a39c-aa615bff9a5e-S0/frameworks/7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000/executors/thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c/runs/54a5ed51-aa9b-476f-9f75-0b42bd6dfa8d
ERROR] Unhandled error in <StatusManager(Thread-7 [TID=25450], started daemon
139968452134656)>. Interrupting main thread.
Traceback (most recent call last):
File
"/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
line 126, in _excepting_run
self.__real_run(*args, **kw)
File "apache/aurora/executor/status_manager.py", line 62, in run
File "apache/aurora/executor/aurora_executor.py", line 236, in _shutdown
RuntimeError: Woops!
Exception in thread Thread-7 [TID=25450]:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File
"/root/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py",
line 115, in identified
return instancemethod(self, *args, **kwargs)
File
"/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
line 130, in _excepting_run
sys.excepthook(*sys.exc_info())
File "apache/thermos/common/excepthook.py", line 41, in teardown_handler
self._former_hook()(exc_type, value, trace)
File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 63, in
apport_excepthook
from apport.fileutils import likely_packaged, get_recent_crashes
ImportError: No module named apport.fileutils
twitter.common.app debug: main exited with ^C
twitter.common.app debug: Shutting application down.
twitter.common.app debug: Running exit function for
apache.thermos.common.excepthook (Exception termination handler.)
twitter.common.app debug: Running exit function for twitter.common.log (Logging
subsystem.)
twitter.common.app debug: Finishing up module teardown.
twitter.common.app debug: Active thread: <_MainThread(MainThread, started
139968622749504)>
twitter.common.app debug: Active thread (daemon):
<TaskResourceMonitor(TaskResourceMonitor[www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c]
[TID=25449], started daemon 139967951009536)>
twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-13,
started daemon 139968485705472)>
twitter.common.app debug: Active thread (daemon): <WaitThread(Thread-9,
started daemon 139967934224128)>
twitter.common.app debug: Active thread (daemon): <WaitThread(Thread-12,
started daemon 139967942616832)>
twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-3,
started daemon 139968510883584)>
twitter.common.app debug: Active thread (daemon): <WaitThread(Thread-11,
started daemon 139967925831424)>
twitter.common.app debug: Exiting cleanly.
```
Corresponding agent logs, indicating that Mesos knows about the crash on
teardown:
```
I1031 15:59:54.692739 1956 slave.cpp:4769] Executor
'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of
framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 exited with status 130
I1031 15:59:54.692834 1956 slave.cpp:4869] Cleaning up executor
'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of
framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 at
executor(1)@192.168.33.7:48931
I1031 15:59:54.692996 1956 slave.cpp:4957] Cleaning up framework
7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000
```
Thanks,
Stephan Erb