----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63443/ -----------------------------------------------------------
Review request for Aurora, Bill Farner and Zameer Manji. Bugs: AURORA-1955 https://issues.apache.org/jira/browse/AURORA-1955 Repository: aurora Description ------- This commit consits of two independent parts: a) ensure we interrupt the main thread when there are unhandled exceptions b) ensure the main thread of the executor can be interrupted Diffs ----- src/main/python/apache/aurora/executor/bin/thermos_executor_main.py a191cf9eec844035c0f6aa5aed3731a06024c0df src/main/python/apache/aurora/tools/thermos.py de20c06cea5bbb45c7a6f5acfeee69289f8e6ad8 src/main/python/apache/aurora/tools/thermos_observer.py 0318f990ac003c0b8925b7eb7359431cdee34f05 src/main/python/apache/thermos/common/excepthook.py PRE-CREATION src/main/python/apache/thermos/runner/thermos_runner.py 847f51ed2c0e003f1325aa903bd0f0b760acb365 Diff: https://reviews.apache.org/r/63443/diff/1/ Testing ------- This bug is pretty hard to reproduce and test. I therefore opted for a manual verification and injected an exception throw shortly before the last statement of the `AuroraExecutor._shutdown` method. Without this patch, this resulted in hanging executors on the host. With this patch everything is terminated as expected. For details of the suffessful run, please see the executor logs below. Please note that the `apport.fileutils` is due to Ubuntu messing with its Python installation. This is not critical. ``` twitter.common.app debug: Initializing: apache.thermos.common.excepthook (Exception termination handler.) I1031 15:59:37.188621 25437 exec.cpp:162] Version: 1.2.0 I1031 15:59:37.192201 25429 exec.cpp:237] Executor registered on agent 93259518-14f4-4956-a39c-aa615bff9a5e-S0 Writing log files to disk in /var/lib/mesos/slaves/93259518-14f4-4956-a39c-aa615bff9a5e-S0/frameworks/7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000/executors/thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c/runs/54a5ed51-aa9b-476f-9f75-0b42bd6dfa8d ERROR] Unhandled error in <StatusManager(Thread-7 [TID=25450], started daemon 139968452134656)>. Interrupting main thread. Traceback (most recent call last): File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run self.__real_run(*args, **kw) File "apache/aurora/executor/status_manager.py", line 62, in run File "apache/aurora/executor/aurora_executor.py", line 236, in _shutdown RuntimeError: Woops! Exception in thread Thread-7 [TID=25450]: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/root/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py", line 115, in identified return instancemethod(self, *args, **kwargs) File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 130, in _excepting_run sys.excepthook(*sys.exc_info()) File "apache/thermos/common/excepthook.py", line 41, in teardown_handler self._former_hook()(exc_type, value, trace) File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 63, in apport_excepthook from apport.fileutils import likely_packaged, get_recent_crashes ImportError: No module named apport.fileutils twitter.common.app debug: main exited with ^C twitter.common.app debug: Shutting application down. twitter.common.app debug: Running exit function for apache.thermos.common.excepthook (Exception termination handler.) twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.) twitter.common.app debug: Finishing up module teardown. twitter.common.app debug: Active thread: <_MainThread(MainThread, started 139968622749504)> twitter.common.app debug: Active thread (daemon): <TaskResourceMonitor(TaskResourceMonitor[www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c] [TID=25449], started daemon 139967951009536)> twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-13, started daemon 139968485705472)> twitter.common.app debug: Active thread (daemon): <WaitThread(Thread-9, started daemon 139967934224128)> twitter.common.app debug: Active thread (daemon): <WaitThread(Thread-12, started daemon 139967942616832)> twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-3, started daemon 139968510883584)> twitter.common.app debug: Active thread (daemon): <WaitThread(Thread-11, started daemon 139967925831424)> twitter.common.app debug: Exiting cleanly. ``` Corresponding agent logs, indicating that Mesos knows about the crash on teardown: ``` I1031 15:59:54.692739 1956 slave.cpp:4769] Executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 exited with status 130 I1031 15:59:54.692834 1956 slave.cpp:4869] Cleaning up executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 at executor(1)@192.168.33.7:48931 I1031 15:59:54.692996 1956 slave.cpp:4957] Cleaning up framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 ``` Thanks, Stephan Erb