[ https://issues.apache.org/jira/browse/AURORA-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15693564#comment-15693564 ]
Stephan Erb commented on AURORA-1801: ------------------------------------- Now on master {code} commit d56f8c64466a94a990db3308a3130d3fce0584af Author: Stephan Erb <s...@apache.org> Date: Thu Nov 24 16:37:51 2016 +0100 Tear down the observer in case of on unhandled errors I was not able to manually trigger the root cause of AURORA-1801 by altering the Mesos filesystem layout. I have therefore adopted the general teardown idea. Example output (using a hardcoded throw): ``` Bottle v0.11.6 server starting up (using CherryPyServer())... Listening on http://192.168.33.7:1338/ Hit Ctrl-C to quit. E1106 23:03:36.722500 8699 exceptional.py:41] Unhandled error in thread Thread-1 [TID=8705]. Tearing down. Traceback (most recent call last): File "apache/thermos/common/exceptional.py", line 37, in _excepting_run self.__real_run(*args, **kw) File "apache/thermos/observer/task_observer.py", line 135, in run self._detector.refresh() File "apache/thermos/observer/detector.py", line 74, in refresh self._refresh_detectors() File "apache/thermos/observer/detector.py", line 58, in _refresh_detectors new_paths = set(self._path_detector.get_paths()) File "apache/aurora/executor/common/path_detector.py", line 35, in get_paths return list(set(path for path in iterate() if os.path.exists(path))) File "apache/aurora/executor/common/path_detector.py", line 35, in <genexpr> return list(set(path for path in iterate() if os.path.exists(path))) File "apache/aurora/executor/common/path_detector.py", line 34, in iterate raise RuntimeError("Fail on purpose...") RuntimeError: Fail on purpose... I1106 23:03:42.513900 8728 static_assets.py:34] detecting assets... I1106 23:03:42.541809 8728 static_assets.py:38] detected asset: observer.js I1106 23:03:42.542799 8728 static_assets.py:38] detected asset: bootstrap.css I1106 23:03:42.543728 8728 static_assets.py:38] detected asset: jquery.pailer.js I1106 23:03:42.544576 8728 static_assets.py:38] detected asset: jquery.js I1106 23:03:42.548482 8728 static_assets.py:38] detected asset: favicon.ico Bottle v0.11.6 server starting up (using CherryPyServer())... Listening on http://192.168.33.7:1338/ Hit Ctrl-C to quit. ``` Bugs closed: AURORA-1801 Reviewed at https://reviews.apache.org/r/53519/ src/main/python/apache/aurora/tools/thermos_observer.py | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) {code} > TaskObserver thread stops refreshing after filesystem race condition > -------------------------------------------------------------------- > > Key: AURORA-1801 > URL: https://issues.apache.org/jira/browse/AURORA-1801 > Project: Aurora > Issue Type: Bug > Components: Observer > Reporter: Stephan Erb > Fix For: 0.17.0 > > > It seems like that a race condition accessing the Mesos filesystem layout can > bubble up and terminate the {{TaskObserver}} thread responsible for > refreshing the internal data structure of available tasks. Restarting the > observer fixes the problem. > Exception triggering the issue: > {code} > Traceback (most recent call last): > File > "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.bce9e54ac7cded79a75603fb4e6bcef2c7d1e6bc/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", > line 126, in _excepting_run > self.__real_run(*args, **kw) > File "apache/thermos/observer/task_observer.py", line 135, in run > File "apache/thermos/observer/detector.py", line 74, in refresh > File "apache/thermos/observer/detector.py", line 58, in _refresh_detectors > File "apache/aurora/executor/common/path_detector.py", line 34, in get_paths > File "apache/aurora/executor/common/path_detector.py", line 34, in <genexpr> > File "apache/aurora/executor/common/path_detector.py", line 33, in iterate > File "/usr/lib/python2.7/posixpath.py", line 376, in realpath > resolved = _resolve_link(component) > File "/usr/lib/python2.7/posixpath.py", line 399, in _resolve_link > resolved = os.readlink(path) > OSError: [Errno 2] No such file or directory: > '/var/lib/mesos/slaves/0768bcb3-205d-4409-a726-3001ad3ef902-S10/frameworks/20151001-085346-58917130-5050-37976-0000/executors/thermos-role-env-myname-0-f9fe0318-d39f-49d3-bdf8-e954d5879b33/runs/latest' > {code} > Solution space: > * terminate the observer process if the {{TaskOberver}} thread fails > * prevent unknown exceptions from aborting the {{TaskOberver}} run loop > * prevent the observed race condition in {{detector.py}} or > {{path_detector.py}} > -- This message was sent by Atlassian JIRA (v6.3.4#6332)