Repository: aurora Updated Branches: refs/heads/master 40d9d4dbe -> cb86e8358
Adding Configurable Wait Period for Graceful Shutdowns We have some services that require more than the current 10 seconds given to gracefully shutdown (they need to close resources, finish requests, etc). We would like to be able to configure the amount of time we wait between each stage of the graceful shutdown sequence. See this [proposal](https://docs.google.com/document/d/1Sl-KWNyt1j0nIndinqfJsH3pkUY5IYXfGWyLHU2wacs/edit?usp=sharing) for a more in-depth analysis. Testing Done: Ran unit and integration tests. Created and killed jobs with varying wait_escalation_secs values on the Vagrant devcluster. Bugs closed: AURORA-1931 Reviewed at https://reviews.apache.org/r/59733/ Project: http://git-wip-us.apache.org/repos/asf/aurora/repo Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/cb86e835 Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/cb86e835 Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/cb86e835 Branch: refs/heads/master Commit: cb86e835818202f2665730ef5036976a3328075e Parents: 40d9d4d Author: Jordan Ly <[email protected]> Authored: Tue Jun 13 11:00:45 2017 -0700 Committer: Santhosh Kumar <[email protected]> Committed: Tue Jun 13 11:00:45 2017 -0700 ---------------------------------------------------------------------- RELEASE-NOTES.md | 18 ++++- docs/reference/configuration.md | 21 +++-- docs/reference/task-lifecycle.md | 6 +- .../python/apache/aurora/config/schema/base.py | 6 ++ .../apache/aurora/executor/aurora_executor.py | 9 ++- .../executor/bin/thermos_executor_main.py | 15 +++- .../apache/aurora/executor/http_lifecycle.py | 28 +++++-- .../apache/aurora/client/cli/test_inspect.py | 4 +- .../bin/test_thermos_executor_entry_point.py | 1 + .../aurora/executor/test_http_lifecycle.py | 11 ++- .../aurora/executor/test_thermos_executor.py | 41 +++++++--- .../aurora/executor/test_thermos_task_runner.py | 83 +++++++++++++++++--- 12 files changed, 195 insertions(+), 48 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/RELEASE-NOTES.md ---------------------------------------------------------------------- diff --git a/RELEASE-NOTES.md b/RELEASE-NOTES.md index d14c4ad..87283de 100644 --- a/RELEASE-NOTES.md +++ b/RELEASE-NOTES.md @@ -1,8 +1,24 @@ -0.18.0 +0.19.0 (unreleased) =================== ### New/updated: +- Added the ability to configure the executor's stop timeout, which is the maximum amount of time + the executor will wait during a graceful shutdown sequence before continuing the 'Forceful + Termination' process (see + [here](http://aurora.apache.org/documentation/latest/reference/task-lifecycle/) for details). +- Added the ability to configure the wait period after calling the graceful shutdown endpoint and + the shutdown endpoint using the `graceful_shutdown_wait_secs` and `shutdown_wait_secs` fields in + `HttpLifecycleConfig` respectively. Previously, the executor would only wait 5 seconds between + steps (adding up to a total of 10 seconds as there are 2 steps). The overall waiting period is + bounded by the executor's stop timeout, which can be configured using the executor's + `stop_timeout_in_secs` flag. + +0.18.0 +====== + +### New/updated: + - Update to Mesos 1.2.0. Please upgrade Aurora to 0.18 before upgrading Mesos to 1.2.0 if you rely on Mesos filesystem images. - Add message parameter to `killTasks` RPC. http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/docs/reference/configuration.md ---------------------------------------------------------------------- diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md index 0040de1..6a9a3ff 100644 --- a/docs/reference/configuration.md +++ b/docs/reference/configuration.md @@ -528,26 +528,31 @@ See [Docker Command Line Reference](https://docs.docker.com/reference/commandlin ### HttpLifecycleConfig Objects - param | type | description - ----- | :----: | ----------- - ```port``` | String | The named port to send POST commands (Default: health) - ```graceful_shutdown_endpoint``` | String | Endpoint to hit to indicate that a task should gracefully shutdown. (Default: /quitquitquit) - ```shutdown_endpoint``` | String | Endpoint to hit to give a task its final warning before being killed. (Default: /abortabortabort) +*Note: The combined `graceful_shutdown_wait_secs` and `shutdown_wait_secs` is implicitly upper bounded by the `--stop_timeout_in_secs` flag exposed by the executor (see options [here](https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py), default is 2 minutes). Therefore, if the user specifies values that add up to more than `--stop_timeout_in_secs`, the task will be killed earlier than the user anticipates (see the termination lifecycle [here](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting)). Furthermore, `stop_timeout_in_secs` itself is implicitly upper bounded by two scheduler options: `transient_task_state_timeout` and `preemption_slot_hold_time` (see reference [here](http://aurora.apache.org/documentation/latest/reference/scheduler-configuration/). If the `stop_timeout_in_secs` exceeds either of these scheduler options, tasks could be designated as LOST or tasks utilizing preemption could lose their desired slot respectively. Cluster operators should be aware of these timings should they change the defaults.* + + param | type | description + ----- | :----: | ----------- + ```port``` | String | The named port to send POST commands. (Default: health) + ```graceful_shutdown_endpoint``` | String | Endpoint to hit to indicate that a task should gracefully shutdown. (Default: /quitquitquit) + ```shutdown_endpoint``` | String | Endpoint to hit to give a task its final warning before being killed. (Default: /abortabortabort) + ```graceful_shutdown_wait_secs``` | Integer | The amount of time (in seconds) to wait after hitting the ```graceful_shutdown_endpoint``` before proceeding with the [task termination lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting). (Default: 5) + ```shutdown_wait_secs``` | Integer | The amount of time (in seconds) to wait after hitting the ```shutdown_endpoint``` before proceeding with the [task termination lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting). (Default: 5) #### graceful_shutdown_endpoint If the Job is listening on the port as specified by the HttpLifecycleConfig (default: `health`), a HTTP POST request will be sent over localhost to this endpoint to request that the task gracefully shut itself down. This is a -courtesy call before the `shutdown_endpoint` is invoked a fixed amount of -time later. +courtesy call before the `shutdown_endpoint` is invoked +`graceful_shutdown_wait_secs` seconds later. #### shutdown_endpoint If the Job is listening on the port as specified by the HttpLifecycleConfig (default: `health`), a HTTP POST request will be sent over localhost to this endpoint to request as a final warning before being shut down. If the task -does not shut down on its own after this, it will be forcefully killed +does not shut down on its own after `shutdown_wait_secs` seconds, it will be +forcefully killed. Specifying Scheduling Constraints http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/docs/reference/task-lifecycle.md ---------------------------------------------------------------------- diff --git a/docs/reference/task-lifecycle.md b/docs/reference/task-lifecycle.md index cf1b679..8ec0077 100644 --- a/docs/reference/task-lifecycle.md +++ b/docs/reference/task-lifecycle.md @@ -81,8 +81,10 @@ In any case, the responsible executor on the agent follows an escalation sequence when killing a running task: 1. If a `HttpLifecycleConfig` is not present, skip to (4). - 2. Send a POST to the `graceful_shutdown_endpoint` and wait 5 seconds. - 3. Send a POST to the `shutdown_endpoint` and wait 5 seconds. + 2. Send a POST to the `graceful_shutdown_endpoint` and wait + `graceful_shutdown_wait_secs` seconds. + 3. Send a POST to the `shutdown_endpoint` and wait + `shutdown_wait_secs` seconds. 4. Send SIGTERM (`kill`) and wait at most `finalization_wait` seconds. 5. Send SIGKILL (`kill -9`). http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/main/python/apache/aurora/config/schema/base.py ---------------------------------------------------------------------- diff --git a/src/main/python/apache/aurora/config/schema/base.py b/src/main/python/apache/aurora/config/schema/base.py index b2692a6..18ce826 100644 --- a/src/main/python/apache/aurora/config/schema/base.py +++ b/src/main/python/apache/aurora/config/schema/base.py @@ -74,6 +74,12 @@ class HttpLifecycleConfig(Struct): # Endpoint to hit to give a task it's final warning before being killed. shutdown_endpoint = Default(String, '/abortabortabort') + # How much time to wait in seconds after calling the graceful shutdown endpoint + graceful_shutdown_wait_secs = Default(Integer, 5) + + # How much time to wait in seconds after calling the shutdown endpoint + shutdown_wait_secs = Default(Integer, 5) + class LifecycleConfig(Struct): http = HttpLifecycleConfig http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/main/python/apache/aurora/executor/aurora_executor.py ---------------------------------------------------------------------- diff --git a/src/main/python/apache/aurora/executor/aurora_executor.py b/src/main/python/apache/aurora/executor/aurora_executor.py index 81461cb..aeb3a4d 100644 --- a/src/main/python/apache/aurora/executor/aurora_executor.py +++ b/src/main/python/apache/aurora/executor/aurora_executor.py @@ -39,7 +39,6 @@ class AuroraExecutor(ExecutorBase, Observable): PERSISTENCE_WAIT = Amount(5, Time.SECONDS) SANDBOX_INITIALIZATION_TIMEOUT = Amount(10, Time.MINUTES) START_TIMEOUT = Amount(2, Time.MINUTES) - STOP_TIMEOUT = Amount(2, Time.MINUTES) STOP_WAIT = Amount(5, Time.SECONDS) def __init__( @@ -50,7 +49,8 @@ class AuroraExecutor(ExecutorBase, Observable): status_providers=(), clock=time, no_sandbox_create_user=False, - sandbox_mount_point=None): + sandbox_mount_point=None, + stop_timeout_in_secs=120): ExecutorBase.__init__(self) if not isinstance(runner_provider, TaskRunnerProvider): @@ -67,6 +67,7 @@ class AuroraExecutor(ExecutorBase, Observable): self._sandbox_provider = sandbox_provider self._no_sandbox_create_user = no_sandbox_create_user self._sandbox_mount_point = sandbox_mount_point + self._stop_timeout = Amount(stop_timeout_in_secs, Time.SECONDS) self._kill_manager = KillManager() # Events that are exposed for interested entities self.runner_aborted = threading.Event() @@ -206,7 +207,7 @@ class AuroraExecutor(ExecutorBase, Observable): runner_status = self._runner.status try: - propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT) + propagate_deadline(self._chained_checker.stop, timeout=self._stop_timeout) except Timeout: log.error('Failed to stop all checkers within deadline.') except Exception: @@ -214,7 +215,7 @@ class AuroraExecutor(ExecutorBase, Observable): log.error(traceback.format_exc()) try: - propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT) + propagate_deadline(self._runner.stop, timeout=self._stop_timeout) except Timeout: log.error('Failed to stop runner within deadline.') except Exception: http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py ---------------------------------------------------------------------- diff --git a/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py b/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py index c6c0898..a191cf9 100644 --- a/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py +++ b/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py @@ -188,6 +188,15 @@ app.add_option( action='store_true', help="Preserve thermos runners' environment variables for the task being run.") +app.add_option( + '--stop_timeout_in_secs', + dest='stop_timeout_in_secs', + type=int, + default=120, + help='The maximum amount of time to wait (in seconds) when gracefully killing a task before ' + 'beginning forceful termination. Graceful and forceful termination is defined in ' + 'HttpLifecycleConfig (see Task Lifecycle documentation for more info on termination).') + # TODO(wickman) Consider just having the OSS version require pip installed # thermos_runner binaries on every machine and instead of embedding the pex @@ -254,7 +263,8 @@ def initialize(options): status_providers=status_providers, sandbox_provider=UserOverrideDirectorySandboxProvider(options.execute_as_user), no_sandbox_create_user=options.no_create_user, - sandbox_mount_point=options.sandbox_mount_point + sandbox_mount_point=options.sandbox_mount_point, + stop_timeout_in_secs=options.stop_timeout_in_secs ) else: thermos_runner_provider = DefaultThermosTaskRunnerProvider( @@ -273,7 +283,8 @@ def initialize(options): runner_provider=thermos_runner_provider, status_providers=status_providers, no_sandbox_create_user=options.no_create_user, - sandbox_mount_point=options.sandbox_mount_point + sandbox_mount_point=options.sandbox_mount_point, + stop_timeout_in_secs=options.stop_timeout_in_secs ) return thermos_executor http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/main/python/apache/aurora/executor/http_lifecycle.py ---------------------------------------------------------------------- diff --git a/src/main/python/apache/aurora/executor/http_lifecycle.py b/src/main/python/apache/aurora/executor/http_lifecycle.py index 9280bf2..01e4d9a 100644 --- a/src/main/python/apache/aurora/executor/http_lifecycle.py +++ b/src/main/python/apache/aurora/executor/http_lifecycle.py @@ -25,7 +25,8 @@ from .common.task_runner import TaskError, TaskRunner class HttpLifecycleManager(TaskRunner): """A wrapper around a TaskRunner that performs HTTP lifecycle management.""" - ESCALATION_WAIT = Amount(5, Time.SECONDS) + DEFAULT_ESCALATION_WAIT = Amount(5, Time.SECONDS) + WAIT_POLL_INTERVAL = Amount(1, Time.SECONDS) @classmethod def wrap(cls, runner, task_instance, portmap): @@ -36,6 +37,14 @@ class HttpLifecycleManager(TaskRunner): http_lifecycle = task_instance.lifecycle().http() http_lifecycle_port = http_lifecycle.port().get() + graceful_shutdown_wait_secs = ( + Amount(http_lifecycle.graceful_shutdown_wait_secs().get(), Time.SECONDS) + if http_lifecycle.has_graceful_shutdown_wait_secs() + else cls.DEFAULT_ESCALATION_WAIT) + shutdown_wait_secs = ( + Amount(http_lifecycle.shutdown_wait_secs().get(), Time.SECONDS) + if http_lifecycle.has_shutdown_wait_secs() + else cls.DEFAULT_ESCALATION_WAIT) if not portmap or http_lifecycle_port not in portmap: # If DefaultLifecycle is ever to disable task lifecycle by default, we should @@ -44,8 +53,8 @@ class HttpLifecycleManager(TaskRunner): return runner escalation_endpoints = [ - http_lifecycle.graceful_shutdown_endpoint().get(), - http_lifecycle.shutdown_endpoint().get() + (http_lifecycle.graceful_shutdown_endpoint().get(), graceful_shutdown_wait_secs), + (http_lifecycle.shutdown_endpoint().get(), shutdown_wait_secs) ] return cls(runner, portmap[http_lifecycle_port], escalation_endpoints) @@ -63,13 +72,20 @@ class HttpLifecycleManager(TaskRunner): def _terminate_http(self): http_signaler = HttpSignaler(self._lifecycle_port) - for endpoint in self._escalation_endpoints: + for endpoint, wait_time in self._escalation_endpoints: handled, _ = http_signaler(endpoint, use_post_method=True) + log.info('Killing task, calling %s and waiting %s, handled is %s' % ( + endpoint, str(wait_time), str(handled))) - if handled: - self._clock.sleep(self.ESCALATION_WAIT.as_(Time.SECONDS)) + waited = Amount(0, Time.SECONDS) + while handled: if self._runner.status is not None: return True + if waited >= wait_time: + break + + self._clock.sleep(self.WAIT_POLL_INTERVAL.as_(Time.SECONDS)) + waited += self.WAIT_POLL_INTERVAL # --- public interface def start(self, timeout=None): http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/client/cli/test_inspect.py ---------------------------------------------------------------------- diff --git a/src/test/python/apache/aurora/client/cli/test_inspect.py b/src/test/python/apache/aurora/client/cli/test_inspect.py index 4a23c59..ecefc18 100644 --- a/src/test/python/apache/aurora/client/cli/test_inspect.py +++ b/src/test/python/apache/aurora/client/cli/test_inspect.py @@ -142,7 +142,9 @@ Process 'process': "http": { "graceful_shutdown_endpoint": "/quitquitquit", "port": "health", - "shutdown_endpoint": "/abortabortabort"}}, + "shutdown_endpoint": "/abortabortabort", + "graceful_shutdown_wait_secs": 5, + "shutdown_wait_secs": 5}}, "priority": 0} mock_output = "\n".join(mock_stdout) http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py ---------------------------------------------------------------------- diff --git a/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py b/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py index 38deae6..5ad2999 100644 --- a/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py +++ b/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py @@ -34,6 +34,7 @@ class ThermosExecutorMainTest(unittest.TestCase): mock_options.execute_as_user = False mock_options.nosetuid = False mock_options.announcer_ensemble = None + mock_options.stop_timeout_in_secs = 1 with patch( 'apache.aurora.executor.bin.thermos_executor_main.dump_runner_pex', return_value=mock_dump_runner_pex): http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/executor/test_http_lifecycle.py ---------------------------------------------------------------------- diff --git a/src/test/python/apache/aurora/executor/test_http_lifecycle.py b/src/test/python/apache/aurora/executor/test_http_lifecycle.py index a967e34..aeba776 100644 --- a/src/test/python/apache/aurora/executor/test_http_lifecycle.py +++ b/src/test/python/apache/aurora/executor/test_http_lifecycle.py @@ -15,6 +15,7 @@ from contextlib import contextmanager import mock +from twitter.common.quantity import Amount, Time from apache.aurora.config.schema.base import HttpLifecycleConfig, LifecycleConfig, MesosTaskInstance from apache.aurora.executor.http_lifecycle import HttpLifecycleManager @@ -53,20 +54,24 @@ def test_http_lifecycle_wrapper_with_lifecycle(): with make_mocks(mti, {'health': 31337}) as (runner_mock, runner_wrapper, wrapper_init): assert isinstance(runner_wrapper, HttpLifecycleManager) assert wrapper_init.mock_calls == [ - mock.call(runner_mock, 31337, ['/quitquitquit', '/abortabortabort']) + mock.call(runner_mock, 31337, [('/quitquitquit', Amount(5, Time.SECONDS)), + ('/abortabortabort', Amount(5, Time.SECONDS))]) ] - # Validate that we can override ports + # Validate that we can override ports, endpoints, wait times mti = MesosTaskInstance(lifecycle=LifecycleConfig(http=HttpLifecycleConfig( port='http', graceful_shutdown_endpoint='/frankfrankfrank', shutdown_endpoint='/bobbobbob', + graceful_shutdown_wait_secs=123, + shutdown_wait_secs=456 ))) portmap = {'http': 12345, 'admin': 54321} with make_mocks(mti, portmap) as (runner_mock, runner_wrapper, wrapper_init): assert isinstance(runner_wrapper, HttpLifecycleManager) assert wrapper_init.mock_calls == [ - mock.call(runner_mock, 12345, ['/frankfrankfrank', '/bobbobbob']) + mock.call(runner_mock, 12345, [('/frankfrankfrank', Amount(123, Time.SECONDS)), + ('/bobbobbob', Amount(456, Time.SECONDS))]) ] http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/executor/test_thermos_executor.py ---------------------------------------------------------------------- diff --git a/src/test/python/apache/aurora/executor/test_thermos_executor.py b/src/test/python/apache/aurora/executor/test_thermos_executor.py index e628ccd..f6ae1be 100644 --- a/src/test/python/apache/aurora/executor/test_thermos_executor.py +++ b/src/test/python/apache/aurora/executor/test_thermos_executor.py @@ -42,7 +42,7 @@ from apache.aurora.config.schema.base import ( Resources, Task ) -from apache.aurora.executor.aurora_executor import AuroraExecutor +from apache.aurora.executor.aurora_executor import AuroraExecutor, propagate_deadline from apache.aurora.executor.common.executor_timeout import ExecutorTimeout from apache.aurora.executor.common.health_checker import HealthCheckerProvider from apache.aurora.executor.common.sandbox import DirectorySandbox, SandboxProvider @@ -218,7 +218,8 @@ def make_executor( fast_status=False, runner_class=ThermosTaskRunner, status_providers=[HealthCheckerProvider()], - assert_task_is_running=True): + assert_task_is_running=True, + stop_timeout_in_secs=120): status_manager_class = FastStatusManager if fast_status else StatusManager runner_provider = make_provider(checkpoint_root, runner_class) @@ -227,6 +228,7 @@ def make_executor( status_manager_class=status_manager_class, sandbox_provider=DefaultTestSandboxProvider(), status_providers=status_providers, + stop_timeout_in_secs=stop_timeout_in_secs ) ExecutorTimeout(te.launched, proxy_driver, timeout=Amount(100, Time.MILLISECONDS)).start() @@ -394,16 +396,35 @@ class TestThermosExecutor(object): def test_killTask(self): # noqa proxy_driver = ProxyDriver() - with temporary_dir() as checkpoint_root: - _, executor = make_executor(proxy_driver, checkpoint_root, SLEEP60_MTI) + class ProvidedThermosRunnerMatcher(object): + """Matcher that ensures a bound method 'stop' from 'ProvidedThermosTaskRunner' is called.""" + + def __eq__(self, other): + return (type(other.im_self).__name__ == 'ProvidedThermosTaskRunner' + and other.__name__ == 'stop') + + with contextlib.nested( + temporary_dir(), + mock.patch('apache.aurora.executor.aurora_executor.propagate_deadline', + wraps=propagate_deadline)) as (checkpoint_root, mock_propagate_deadline): + + _, executor = make_executor( + proxy_driver, + checkpoint_root, + SLEEP60_MTI, + stop_timeout_in_secs=123) # send two, expect at most one delivered executor.killTask(proxy_driver, mesos_pb2.TaskID(value='sleep60-001')) executor.killTask(proxy_driver, mesos_pb2.TaskID(value='sleep60-001')) executor.terminated.wait() - updates = proxy_driver.method_calls['sendStatusUpdate'] - assert len(updates) == 3 - assert updates[-1][0][0].state == mesos_pb2.TASK_KILLED + updates = proxy_driver.method_calls['sendStatusUpdate'] + + mock_propagate_deadline.assert_called_with( # Ensure 'stop' is called with custom timeout. + ProvidedThermosRunnerMatcher(), + timeout=Amount(123, Time.SECONDS)) + assert len(updates) == 3 + assert updates[-1][0][0].state == mesos_pb2.TASK_KILLED def test_shutdown(self): proxy_driver = ProxyDriver() @@ -618,12 +639,11 @@ class TestThermosExecutor(object): with temporary_dir() as tempdir: te = FastThermosExecutor( runner_provider=make_provider(tempdir, mesos_containerizer_path='/doesnotexist'), - sandbox_provider=FileSystemImageTestSandboxProvider()) + sandbox_provider=FileSystemImageTestSandboxProvider(), stop_timeout_in_secs=1) te.launchTask(proxy_driver, make_task(HELLO_WORLD_MTI)) te.SANDBOX_INITIALIZATION_TIMEOUT = Amount(1, Time.MILLISECONDS) te.START_TIMEOUT = Amount(10, Time.MILLISECONDS) - te.STOP_TIMEOUT = Amount(10, Time.MILLISECONDS) proxy_driver.wait_stopped() @@ -643,11 +663,10 @@ class TestThermosExecutor(object): te = FastThermosExecutor( runner_provider=make_provider(tempdir, mesos_containerizer_path=tempfile), - sandbox_provider=FileSystemImageTestSandboxProvider()) + sandbox_provider=FileSystemImageTestSandboxProvider(), stop_timeout_in_secs=1) te.SANDBOX_INITIALIZATION_TIMEOUT = Amount(1, Time.MILLISECONDS) te.START_TIMEOUT = Amount(10, Time.MILLISECONDS) - te.STOP_TIMEOUT = Amount(10, Time.MILLISECONDS) te.launchTask(proxy_driver, make_task(HELLO_WORLD_MTI)) http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/executor/test_thermos_task_runner.py ---------------------------------------------------------------------- diff --git a/src/test/python/apache/aurora/executor/test_thermos_task_runner.py b/src/test/python/apache/aurora/executor/test_thermos_task_runner.py index 1b92667..7096557 100644 --- a/src/test/python/apache/aurora/executor/test_thermos_task_runner.py +++ b/src/test/python/apache/aurora/executor/test_thermos_task_runner.py @@ -32,6 +32,7 @@ from twitter.common.quantity import Amount, Time from apache.aurora.config.schema.base import MB, MesosTaskInstance, Process, Resources, Task from apache.aurora.executor.common.sandbox import DirectorySandbox +from apache.aurora.executor.common.status_checker import StatusResult from apache.aurora.executor.http_lifecycle import HttpLifecycleManager from apache.aurora.executor.thermos_task_runner import ThermosTaskRunner from apache.thermos.common.statuses import ( @@ -227,33 +228,95 @@ class TestThermosTaskRunnerIntegration(object): assert task_runner.status.status == mesos_pb2.TASK_KILLED @patch('apache.aurora.executor.http_lifecycle.HttpSignaler') - def test_integration_http_teardown(self, SignalerClass): + def test_integration_http_teardown_killed(self, SignalerClass): + """Ensure that the http teardown procedure closes correctly when abort kills the process.""" signaler = SignalerClass.return_value - signaler.side_effect = lambda path, use_post_method: (path != '/quitquitquit', None) + signaler.side_effect = lambda path, use_post_method: (path == '/abortabortabort', None) clock = Mock(wraps=time) + class TerminalStateStatusRunner(ThermosTaskRunner): + """ + Status is called each poll in the teardown procedure. We return kill after the 3rd poll + to mimic a task that exits early. We want to ensure the shutdown procedure doesn't wait + the full time if it doesn't need to. + """ + + TIMES_CALLED = 0 + + @property + def status(self): + if (self.TIMES_CALLED >= 3): + return StatusResult('Test task mock status', mesos_pb2.TASK_KILLED) + self.TIMES_CALLED += 1 + with self.yield_sleepy( - ThermosTaskRunner, + TerminalStateStatusRunner, portmap={'health': 3141}, clock=clock, - sleep=1000, + sleep=0, exit_code=0) as task_runner: - class ImmediateHttpLifecycleManager(HttpLifecycleManager): - ESCALATION_WAIT = Amount(1, Time.MICROSECONDS) + graceful_shutdown_wait = Amount(1, Time.SECONDS) + shutdown_wait = Amount(5, Time.SECONDS) + http_task_runner = HttpLifecycleManager( + task_runner, 3141, [('/quitquitquit', graceful_shutdown_wait), + ('/abortabortabort', shutdown_wait)], clock=clock) + http_task_runner.start() + task_runner.forked.wait() + http_task_runner.stop() + + http_teardown_poll_wait_call = call(HttpLifecycleManager.WAIT_POLL_INTERVAL.as_(Time.SECONDS)) + assert clock.sleep.mock_calls.count(http_teardown_poll_wait_call) == 3 # Killed before 5 + assert signaler.mock_calls == [ + call('/quitquitquit', use_post_method=True), + call('/abortabortabort', use_post_method=True)] + + @patch('apache.aurora.executor.http_lifecycle.HttpSignaler') + def test_integration_http_teardown_escalate(self, SignalerClass): + """Ensure that the http teardown process fully escalates when quit/abort both fail to kill.""" + signaler = SignalerClass.return_value + signaler.side_effect = lambda path, use_post_method: (True, None) + + clock = Mock(wraps=time) + + class KillCalledTaskRunner(ThermosTaskRunner): + def __init__(self, *args, **kwargs): + self._killed_called = False + ThermosTaskRunner.__init__(self, *args, **kwargs) + + def kill_called(self): + return self._killed_called + + def kill(self): + self._killed_called = True + + @property + def status(self): + return None + + with self.yield_sleepy( + KillCalledTaskRunner, + portmap={'health': 3141}, + clock=clock, + sleep=0, + exit_code=0) as task_runner: - http_task_runner = ImmediateHttpLifecycleManager( - task_runner, 3141, ['/quitquitquit', '/abortabortabort'], clock=clock) + graceful_shutdown_wait = Amount(1, Time.SECONDS) + shutdown_wait = Amount(5, Time.SECONDS) + http_task_runner = HttpLifecycleManager( + task_runner, 3141, [('/quitquitquit', graceful_shutdown_wait), + ('/abortabortabort', shutdown_wait)], clock=clock) http_task_runner.start() task_runner.forked.wait() http_task_runner.stop() - escalation_wait = call(ImmediateHttpLifecycleManager.ESCALATION_WAIT.as_(Time.SECONDS)) - assert clock.sleep.mock_calls.count(escalation_wait) == 1 + http_teardown_poll_wait_call = call(HttpLifecycleManager.WAIT_POLL_INTERVAL.as_(Time.SECONDS)) + assert clock.sleep.mock_calls.count(http_teardown_poll_wait_call) == 6 assert signaler.mock_calls == [ call('/quitquitquit', use_post_method=True), call('/abortabortabort', use_post_method=True)] + assert task_runner.kill_called() == True def test_thermos_normal_exit_status(self): with self.exit_with_status(0, TaskState.SUCCESS) as task_runner:
