aurora git commit: Adding Configurable Wait Period for Graceful Shutdowns

santhk Tue, 13 Jun 2017 11:02:05 -0700

Repository: aurora
Updated Branches:
  refs/heads/master 40d9d4dbe -> cb86e8358



Adding Configurable Wait Period for Graceful Shutdowns

We have some services that require more than the current 10 seconds given to
gracefully shutdown (they need to close resources, finish requests, etc).

We would like to be able to configure the amount of time we wait between each
stage of the graceful shutdown sequence. See this 
[proposal](https://docs.google.com/document/d/1Sl-KWNyt1j0nIndinqfJsH3pkUY5IYXfGWyLHU2wacs/edit?usp=sharing)
 for a more in-depth
analysis.

Testing Done:
Ran unit and integration tests.

Created and killed jobs with varying wait_escalation_secs values on the Vagrant 
devcluster.

Bugs closed: AURORA-1931

Reviewed at https://reviews.apache.org/r/59733/


Project: http://git-wip-us.apache.org/repos/asf/aurora/repo
Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/cb86e835
Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/cb86e835
Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/cb86e835

Branch: refs/heads/master
Commit: cb86e835818202f2665730ef5036976a3328075e
Parents: 40d9d4d
Author: Jordan Ly <[email protected]>
Authored: Tue Jun 13 11:00:45 2017 -0700
Committer: Santhosh Kumar <[email protected]>
Committed: Tue Jun 13 11:00:45 2017 -0700

----------------------------------------------------------------------
 RELEASE-NOTES.md                                | 18 ++++-
 docs/reference/configuration.md                 | 21 +++--
 docs/reference/task-lifecycle.md                |  6 +-
 .../python/apache/aurora/config/schema/base.py  |  6 ++
 .../apache/aurora/executor/aurora_executor.py   |  9 ++-
 .../executor/bin/thermos_executor_main.py       | 15 +++-
 .../apache/aurora/executor/http_lifecycle.py    | 28 +++++--
 .../apache/aurora/client/cli/test_inspect.py    |  4 +-
 .../bin/test_thermos_executor_entry_point.py    |  1 +
 .../aurora/executor/test_http_lifecycle.py      | 11 ++-
 .../aurora/executor/test_thermos_executor.py    | 41 +++++++---
 .../aurora/executor/test_thermos_task_runner.py | 83 +++++++++++++++++---
 12 files changed, 195 insertions(+), 48 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/RELEASE-NOTES.md
----------------------------------------------------------------------
diff --git a/RELEASE-NOTES.md b/RELEASE-NOTES.md
index d14c4ad..87283de 100644
--- a/RELEASE-NOTES.md
+++ b/RELEASE-NOTES.md
@@ -1,8 +1,24 @@
-0.18.0
+0.19.0 (unreleased)
 ===================
 
 ### New/updated:
 
+- Added the ability to configure the executor's stop timeout, which is the 
maximum amount of time
+  the executor will wait during a graceful shutdown sequence before continuing 
the 'Forceful
+  Termination' process (see
+  
[here](http://aurora.apache.org/documentation/latest/reference/task-lifecycle/) 
for details).
+- Added the ability to configure the wait period after calling the graceful 
shutdown endpoint and
+  the shutdown endpoint using the `graceful_shutdown_wait_secs` and 
`shutdown_wait_secs` fields in
+  `HttpLifecycleConfig` respectively. Previously, the executor would only wait 
5 seconds between
+  steps (adding up to a total of 10 seconds as there are 2 steps). The overall 
waiting period is
+  bounded by the executor's stop timeout, which can be configured using the 
executor's
+  `stop_timeout_in_secs` flag.
+
+0.18.0
+======
+
+### New/updated:
+
 - Update to Mesos 1.2.0. Please upgrade Aurora to 0.18 before upgrading Mesos 
to 1.2.0 if you rely
   on Mesos filesystem images.
 - Add message parameter to `killTasks` RPC.

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/docs/reference/configuration.md
----------------------------------------------------------------------
diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md
index 0040de1..6a9a3ff 100644
--- a/docs/reference/configuration.md
+++ b/docs/reference/configuration.md
@@ -528,26 +528,31 @@ See [Docker Command Line 
Reference](https://docs.docker.com/reference/commandlin
 
 ### HttpLifecycleConfig Objects
 
-  param          | type            | description
-  -----          | :----:          | -----------
-  ```port```     | String          | The named port to send POST commands 
(Default: health)
-  ```graceful_shutdown_endpoint``` | String | Endpoint to hit to indicate that 
a task should gracefully shutdown. (Default: /quitquitquit)
-  ```shutdown_endpoint``` | String | Endpoint to hit to give a task its final 
warning before being killed. (Default: /abortabortabort)
+*Note: The combined `graceful_shutdown_wait_secs` and `shutdown_wait_secs` is 
implicitly upper bounded by the `--stop_timeout_in_secs` flag exposed by the 
executor (see options 
[here](https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py),
 default is 2 minutes). Therefore, if the user specifies values that add up to 
more than `--stop_timeout_in_secs`, the task will be killed earlier than the 
user anticipates (see the termination lifecycle 
[here](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting)).
 Furthermore, `stop_timeout_in_secs` itself is implicitly upper bounded by two 
scheduler options: `transient_task_state_timeout` and 
`preemption_slot_hold_time` (see reference 
[here](http://aurora.apache.org/documentation/latest/reference/scheduler-configuration/).
 If the `stop_timeout_in_secs` exceeds either of these scheduler options, tasks 
could be designated as LOST 
 or tasks utilizing preemption could lose their desired slot respectively. 
Cluster operators should be aware of these timings should they change the 
defaults.*
+
+  param                             | type    | description
+  -----                             | :----:  | -----------
+  ```port```                        | String  | The named port to send POST 
commands. (Default: health)
+  ```graceful_shutdown_endpoint```  | String  | Endpoint to hit to indicate 
that a task should gracefully shutdown. (Default: /quitquitquit)
+  ```shutdown_endpoint```           | String  | Endpoint to hit to give a task 
its final warning before being killed. (Default: /abortabortabort)
+  ```graceful_shutdown_wait_secs``` | Integer | The amount of time (in 
seconds) to wait after hitting the ```graceful_shutdown_endpoint``` before 
proceeding with the [task termination 
lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting).
 (Default: 5)
+  ```shutdown_wait_secs```          | Integer | The amount of time (in 
seconds) to wait after hitting the ```shutdown_endpoint``` before proceeding 
with the [task termination 
lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting).
 (Default: 5)
 
 #### graceful_shutdown_endpoint
 
 If the Job is listening on the port as specified by the HttpLifecycleConfig
 (default: `health`), a HTTP POST request will be sent over localhost to this
 endpoint to request that the task gracefully shut itself down.  This is a
-courtesy call before the `shutdown_endpoint` is invoked a fixed amount of
-time later.
+courtesy call before the `shutdown_endpoint` is invoked
+`graceful_shutdown_wait_secs` seconds later.
 
 #### shutdown_endpoint
 
 If the Job is listening on the port as specified by the HttpLifecycleConfig
 (default: `health`), a HTTP POST request will be sent over localhost to this
 endpoint to request as a final warning before being shut down.  If the task
-does not shut down on its own after this, it will be forcefully killed
+does not shut down on its own after `shutdown_wait_secs` seconds, it will be
+forcefully killed.
 
 
 Specifying Scheduling Constraints

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/docs/reference/task-lifecycle.md
----------------------------------------------------------------------
diff --git a/docs/reference/task-lifecycle.md b/docs/reference/task-lifecycle.md
index cf1b679..8ec0077 100644
--- a/docs/reference/task-lifecycle.md
+++ b/docs/reference/task-lifecycle.md
@@ -81,8 +81,10 @@ In any case, the responsible executor on the agent follows 
an escalation
 sequence when killing a running task:
 
   1. If a `HttpLifecycleConfig` is not present, skip to (4).
-  2. Send a POST to the `graceful_shutdown_endpoint` and wait 5 seconds.
-  3. Send a POST to the `shutdown_endpoint` and wait 5 seconds.
+  2. Send a POST to the `graceful_shutdown_endpoint` and wait
+  `graceful_shutdown_wait_secs` seconds.
+  3. Send a POST to the `shutdown_endpoint` and wait
+  `shutdown_wait_secs` seconds.
   4. Send SIGTERM (`kill`) and wait at most `finalization_wait` seconds.
   5. Send SIGKILL (`kill -9`).
 

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/main/python/apache/aurora/config/schema/base.py
----------------------------------------------------------------------
diff --git a/src/main/python/apache/aurora/config/schema/base.py 
b/src/main/python/apache/aurora/config/schema/base.py
index b2692a6..18ce826 100644
--- a/src/main/python/apache/aurora/config/schema/base.py
+++ b/src/main/python/apache/aurora/config/schema/base.py
@@ -74,6 +74,12 @@ class HttpLifecycleConfig(Struct):
   # Endpoint to hit to give a task it's final warning before being killed.
   shutdown_endpoint = Default(String, '/abortabortabort')
 
+  # How much time to wait in seconds after calling the graceful shutdown 
endpoint
+  graceful_shutdown_wait_secs = Default(Integer, 5)
+
+  # How much time to wait in seconds after calling the shutdown endpoint
+  shutdown_wait_secs = Default(Integer, 5)
+
 
 class LifecycleConfig(Struct):
   http = HttpLifecycleConfig

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/main/python/apache/aurora/executor/aurora_executor.py
----------------------------------------------------------------------
diff --git a/src/main/python/apache/aurora/executor/aurora_executor.py 
b/src/main/python/apache/aurora/executor/aurora_executor.py
index 81461cb..aeb3a4d 100644
--- a/src/main/python/apache/aurora/executor/aurora_executor.py
+++ b/src/main/python/apache/aurora/executor/aurora_executor.py
@@ -39,7 +39,6 @@ class AuroraExecutor(ExecutorBase, Observable):
   PERSISTENCE_WAIT = Amount(5, Time.SECONDS)
   SANDBOX_INITIALIZATION_TIMEOUT = Amount(10, Time.MINUTES)
   START_TIMEOUT = Amount(2, Time.MINUTES)
-  STOP_TIMEOUT = Amount(2, Time.MINUTES)
   STOP_WAIT = Amount(5, Time.SECONDS)
 
   def __init__(
@@ -50,7 +49,8 @@ class AuroraExecutor(ExecutorBase, Observable):
       status_providers=(),
       clock=time,
       no_sandbox_create_user=False,
-      sandbox_mount_point=None):
+      sandbox_mount_point=None,
+      stop_timeout_in_secs=120):
 
     ExecutorBase.__init__(self)
     if not isinstance(runner_provider, TaskRunnerProvider):
@@ -67,6 +67,7 @@ class AuroraExecutor(ExecutorBase, Observable):
     self._sandbox_provider = sandbox_provider
     self._no_sandbox_create_user = no_sandbox_create_user
     self._sandbox_mount_point = sandbox_mount_point
+    self._stop_timeout = Amount(stop_timeout_in_secs, Time.SECONDS)
     self._kill_manager = KillManager()
     # Events that are exposed for interested entities
     self.runner_aborted = threading.Event()
@@ -206,7 +207,7 @@ class AuroraExecutor(ExecutorBase, Observable):
     runner_status = self._runner.status
 
     try:
-      propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
+      propagate_deadline(self._chained_checker.stop, 
timeout=self._stop_timeout)
     except Timeout:
       log.error('Failed to stop all checkers within deadline.')
     except Exception:
@@ -214,7 +215,7 @@ class AuroraExecutor(ExecutorBase, Observable):
       log.error(traceback.format_exc())
 
     try:
-      propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
+      propagate_deadline(self._runner.stop, timeout=self._stop_timeout)
     except Timeout:
       log.error('Failed to stop runner within deadline.')
     except Exception:

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py
----------------------------------------------------------------------
diff --git 
a/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py 
b/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py
index c6c0898..a191cf9 100644
--- a/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py
+++ b/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py
@@ -188,6 +188,15 @@ app.add_option(
      action='store_true',
      help="Preserve thermos runners' environment variables for the task being 
run.")
 
+app.add_option(
+    '--stop_timeout_in_secs',
+    dest='stop_timeout_in_secs',
+    type=int,
+    default=120,
+    help='The maximum amount of time to wait (in seconds) when gracefully 
killing a task before '
+         'beginning forceful termination. Graceful and forceful termination is 
defined in '
+         'HttpLifecycleConfig (see Task Lifecycle documentation for more info 
on termination).')
+
 
 # TODO(wickman) Consider just having the OSS version require pip installed
 # thermos_runner binaries on every machine and instead of embedding the pex
@@ -254,7 +263,8 @@ def initialize(options):
       status_providers=status_providers,
       
sandbox_provider=UserOverrideDirectorySandboxProvider(options.execute_as_user),
       no_sandbox_create_user=options.no_create_user,
-      sandbox_mount_point=options.sandbox_mount_point
+      sandbox_mount_point=options.sandbox_mount_point,
+      stop_timeout_in_secs=options.stop_timeout_in_secs
     )
   else:
     thermos_runner_provider = DefaultThermosTaskRunnerProvider(
@@ -273,7 +283,8 @@ def initialize(options):
       runner_provider=thermos_runner_provider,
       status_providers=status_providers,
       no_sandbox_create_user=options.no_create_user,
-      sandbox_mount_point=options.sandbox_mount_point
+      sandbox_mount_point=options.sandbox_mount_point,
+      stop_timeout_in_secs=options.stop_timeout_in_secs
     )
 
   return thermos_executor

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/main/python/apache/aurora/executor/http_lifecycle.py
----------------------------------------------------------------------
diff --git a/src/main/python/apache/aurora/executor/http_lifecycle.py 
b/src/main/python/apache/aurora/executor/http_lifecycle.py
index 9280bf2..01e4d9a 100644
--- a/src/main/python/apache/aurora/executor/http_lifecycle.py
+++ b/src/main/python/apache/aurora/executor/http_lifecycle.py
@@ -25,7 +25,8 @@ from .common.task_runner import TaskError, TaskRunner
 class HttpLifecycleManager(TaskRunner):
   """A wrapper around a TaskRunner that performs HTTP lifecycle management."""
 
-  ESCALATION_WAIT = Amount(5, Time.SECONDS)
+  DEFAULT_ESCALATION_WAIT = Amount(5, Time.SECONDS)
+  WAIT_POLL_INTERVAL = Amount(1, Time.SECONDS)
 
   @classmethod
   def wrap(cls, runner, task_instance, portmap):
@@ -36,6 +37,14 @@ class HttpLifecycleManager(TaskRunner):
 
     http_lifecycle = task_instance.lifecycle().http()
     http_lifecycle_port = http_lifecycle.port().get()
+    graceful_shutdown_wait_secs = (
+        Amount(http_lifecycle.graceful_shutdown_wait_secs().get(), 
Time.SECONDS)
+        if http_lifecycle.has_graceful_shutdown_wait_secs()
+        else cls.DEFAULT_ESCALATION_WAIT)
+    shutdown_wait_secs = (
+        Amount(http_lifecycle.shutdown_wait_secs().get(), Time.SECONDS)
+        if http_lifecycle.has_shutdown_wait_secs()
+        else cls.DEFAULT_ESCALATION_WAIT)
 
     if not portmap or http_lifecycle_port not in portmap:
       # If DefaultLifecycle is ever to disable task lifecycle by default, we 
should
@@ -44,8 +53,8 @@ class HttpLifecycleManager(TaskRunner):
       return runner
 
     escalation_endpoints = [
-        http_lifecycle.graceful_shutdown_endpoint().get(),
-        http_lifecycle.shutdown_endpoint().get()
+        (http_lifecycle.graceful_shutdown_endpoint().get(), 
graceful_shutdown_wait_secs),
+        (http_lifecycle.shutdown_endpoint().get(), shutdown_wait_secs)
     ]
     return cls(runner, portmap[http_lifecycle_port], escalation_endpoints)
 
@@ -63,13 +72,20 @@ class HttpLifecycleManager(TaskRunner):
   def _terminate_http(self):
     http_signaler = HttpSignaler(self._lifecycle_port)
 
-    for endpoint in self._escalation_endpoints:
+    for endpoint, wait_time in self._escalation_endpoints:
       handled, _ = http_signaler(endpoint, use_post_method=True)
+      log.info('Killing task, calling %s and waiting %s, handled is %s' % (
+          endpoint, str(wait_time), str(handled)))
 
-      if handled:
-        self._clock.sleep(self.ESCALATION_WAIT.as_(Time.SECONDS))
+      waited = Amount(0, Time.SECONDS)
+      while handled:
         if self._runner.status is not None:
           return True
+        if waited >= wait_time:
+          break
+
+        self._clock.sleep(self.WAIT_POLL_INTERVAL.as_(Time.SECONDS))
+        waited += self.WAIT_POLL_INTERVAL
 
   # --- public interface
   def start(self, timeout=None):

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/client/cli/test_inspect.py
----------------------------------------------------------------------
diff --git a/src/test/python/apache/aurora/client/cli/test_inspect.py 
b/src/test/python/apache/aurora/client/cli/test_inspect.py
index 4a23c59..ecefc18 100644
--- a/src/test/python/apache/aurora/client/cli/test_inspect.py
+++ b/src/test/python/apache/aurora/client/cli/test_inspect.py
@@ -142,7 +142,9 @@ Process 'process':
             "http": {
                 "graceful_shutdown_endpoint": "/quitquitquit",
                 "port": "health",
-                "shutdown_endpoint": "/abortabortabort"}},
+                "shutdown_endpoint": "/abortabortabort",
+                "graceful_shutdown_wait_secs": 5,
+                "shutdown_wait_secs": 5}},
         "priority": 0}
 
     mock_output = "\n".join(mock_stdout)

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py
----------------------------------------------------------------------
diff --git 
a/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py
 
b/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py
index 38deae6..5ad2999 100644
--- 
a/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py
+++ 
b/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py
@@ -34,6 +34,7 @@ class ThermosExecutorMainTest(unittest.TestCase):
     mock_options.execute_as_user = False
     mock_options.nosetuid = False
     mock_options.announcer_ensemble = None
+    mock_options.stop_timeout_in_secs = 1
     with patch(
         'apache.aurora.executor.bin.thermos_executor_main.dump_runner_pex',
         return_value=mock_dump_runner_pex):

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/executor/test_http_lifecycle.py
----------------------------------------------------------------------
diff --git a/src/test/python/apache/aurora/executor/test_http_lifecycle.py 
b/src/test/python/apache/aurora/executor/test_http_lifecycle.py
index a967e34..aeba776 100644
--- a/src/test/python/apache/aurora/executor/test_http_lifecycle.py
+++ b/src/test/python/apache/aurora/executor/test_http_lifecycle.py
@@ -15,6 +15,7 @@
 from contextlib import contextmanager
 
 import mock
+from twitter.common.quantity import Amount, Time
 
 from apache.aurora.config.schema.base import HttpLifecycleConfig, 
LifecycleConfig, MesosTaskInstance
 from apache.aurora.executor.http_lifecycle import HttpLifecycleManager
@@ -53,20 +54,24 @@ def test_http_lifecycle_wrapper_with_lifecycle():
   with make_mocks(mti, {'health': 31337}) as (runner_mock, runner_wrapper, 
wrapper_init):
     assert isinstance(runner_wrapper, HttpLifecycleManager)
     assert wrapper_init.mock_calls == [
-      mock.call(runner_mock, 31337, ['/quitquitquit', '/abortabortabort'])
+      mock.call(runner_mock, 31337, [('/quitquitquit', Amount(5, 
Time.SECONDS)),
+        ('/abortabortabort', Amount(5, Time.SECONDS))])
     ]
 
-  # Validate that we can override ports
+  # Validate that we can override ports, endpoints, wait times
   mti = MesosTaskInstance(lifecycle=LifecycleConfig(http=HttpLifecycleConfig(
       port='http',
       graceful_shutdown_endpoint='/frankfrankfrank',
       shutdown_endpoint='/bobbobbob',
+      graceful_shutdown_wait_secs=123,
+      shutdown_wait_secs=456
   )))
   portmap = {'http': 12345, 'admin': 54321}
   with make_mocks(mti, portmap) as (runner_mock, runner_wrapper, wrapper_init):
     assert isinstance(runner_wrapper, HttpLifecycleManager)
     assert wrapper_init.mock_calls == [
-      mock.call(runner_mock, 12345, ['/frankfrankfrank', '/bobbobbob'])
+      mock.call(runner_mock, 12345, [('/frankfrankfrank', Amount(123, 
Time.SECONDS)),
+        ('/bobbobbob', Amount(456, Time.SECONDS))])
     ]
 
 

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/executor/test_thermos_executor.py
----------------------------------------------------------------------
diff --git a/src/test/python/apache/aurora/executor/test_thermos_executor.py 
b/src/test/python/apache/aurora/executor/test_thermos_executor.py
index e628ccd..f6ae1be 100644
--- a/src/test/python/apache/aurora/executor/test_thermos_executor.py
+++ b/src/test/python/apache/aurora/executor/test_thermos_executor.py
@@ -42,7 +42,7 @@ from apache.aurora.config.schema.base import (
     Resources,
     Task
 )
-from apache.aurora.executor.aurora_executor import AuroraExecutor
+from apache.aurora.executor.aurora_executor import AuroraExecutor, 
propagate_deadline
 from apache.aurora.executor.common.executor_timeout import ExecutorTimeout
 from apache.aurora.executor.common.health_checker import HealthCheckerProvider
 from apache.aurora.executor.common.sandbox import DirectorySandbox, 
SandboxProvider
@@ -218,7 +218,8 @@ def make_executor(
     fast_status=False,
     runner_class=ThermosTaskRunner,
     status_providers=[HealthCheckerProvider()],
-    assert_task_is_running=True):
+    assert_task_is_running=True,
+    stop_timeout_in_secs=120):
 
   status_manager_class = FastStatusManager if fast_status else StatusManager
   runner_provider = make_provider(checkpoint_root, runner_class)
@@ -227,6 +228,7 @@ def make_executor(
       status_manager_class=status_manager_class,
       sandbox_provider=DefaultTestSandboxProvider(),
       status_providers=status_providers,
+      stop_timeout_in_secs=stop_timeout_in_secs
   )
 
   ExecutorTimeout(te.launched, proxy_driver, timeout=Amount(100, 
Time.MILLISECONDS)).start()
@@ -394,16 +396,35 @@ class TestThermosExecutor(object):
   def test_killTask(self):  # noqa
     proxy_driver = ProxyDriver()
 
-    with temporary_dir() as checkpoint_root:
-      _, executor = make_executor(proxy_driver, checkpoint_root, SLEEP60_MTI)
+    class ProvidedThermosRunnerMatcher(object):
+      """Matcher that ensures a bound method 'stop' from 
'ProvidedThermosTaskRunner' is called."""
+
+      def __eq__(self, other):
+        return (type(other.im_self).__name__ == 'ProvidedThermosTaskRunner'
+            and other.__name__ == 'stop')
+
+    with contextlib.nested(
+        temporary_dir(),
+        mock.patch('apache.aurora.executor.aurora_executor.propagate_deadline',
+            wraps=propagate_deadline)) as (checkpoint_root, 
mock_propagate_deadline):
+
+      _, executor = make_executor(
+          proxy_driver,
+          checkpoint_root,
+          SLEEP60_MTI,
+          stop_timeout_in_secs=123)
       # send two, expect at most one delivered
       executor.killTask(proxy_driver, mesos_pb2.TaskID(value='sleep60-001'))
       executor.killTask(proxy_driver, mesos_pb2.TaskID(value='sleep60-001'))
       executor.terminated.wait()
 
-    updates = proxy_driver.method_calls['sendStatusUpdate']
-    assert len(updates) == 3
-    assert updates[-1][0][0].state == mesos_pb2.TASK_KILLED
+      updates = proxy_driver.method_calls['sendStatusUpdate']
+
+      mock_propagate_deadline.assert_called_with(  # Ensure 'stop' is called 
with custom timeout.
+          ProvidedThermosRunnerMatcher(),
+          timeout=Amount(123, Time.SECONDS))
+      assert len(updates) == 3
+      assert updates[-1][0][0].state == mesos_pb2.TASK_KILLED
 
   def test_shutdown(self):
     proxy_driver = ProxyDriver()
@@ -618,12 +639,11 @@ class TestThermosExecutor(object):
     with temporary_dir() as tempdir:
       te = FastThermosExecutor(
         runner_provider=make_provider(tempdir, 
mesos_containerizer_path='/doesnotexist'),
-        sandbox_provider=FileSystemImageTestSandboxProvider())
+        sandbox_provider=FileSystemImageTestSandboxProvider(), 
stop_timeout_in_secs=1)
       te.launchTask(proxy_driver, make_task(HELLO_WORLD_MTI))
 
       te.SANDBOX_INITIALIZATION_TIMEOUT = Amount(1, Time.MILLISECONDS)
       te.START_TIMEOUT = Amount(10, Time.MILLISECONDS)
-      te.STOP_TIMEOUT = Amount(10, Time.MILLISECONDS)
 
       proxy_driver.wait_stopped()
 
@@ -643,11 +663,10 @@ class TestThermosExecutor(object):
 
       te = FastThermosExecutor(
         runner_provider=make_provider(tempdir, 
mesos_containerizer_path=tempfile),
-        sandbox_provider=FileSystemImageTestSandboxProvider())
+        sandbox_provider=FileSystemImageTestSandboxProvider(), 
stop_timeout_in_secs=1)
 
       te.SANDBOX_INITIALIZATION_TIMEOUT = Amount(1, Time.MILLISECONDS)
       te.START_TIMEOUT = Amount(10, Time.MILLISECONDS)
-      te.STOP_TIMEOUT = Amount(10, Time.MILLISECONDS)
 
       te.launchTask(proxy_driver, make_task(HELLO_WORLD_MTI))
 

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/executor/test_thermos_task_runner.py
----------------------------------------------------------------------
diff --git a/src/test/python/apache/aurora/executor/test_thermos_task_runner.py 
b/src/test/python/apache/aurora/executor/test_thermos_task_runner.py
index 1b92667..7096557 100644
--- a/src/test/python/apache/aurora/executor/test_thermos_task_runner.py
+++ b/src/test/python/apache/aurora/executor/test_thermos_task_runner.py
@@ -32,6 +32,7 @@ from twitter.common.quantity import Amount, Time
 
 from apache.aurora.config.schema.base import MB, MesosTaskInstance, Process, 
Resources, Task
 from apache.aurora.executor.common.sandbox import DirectorySandbox
+from apache.aurora.executor.common.status_checker import StatusResult
 from apache.aurora.executor.http_lifecycle import HttpLifecycleManager
 from apache.aurora.executor.thermos_task_runner import ThermosTaskRunner
 from apache.thermos.common.statuses import (
@@ -227,33 +228,95 @@ class TestThermosTaskRunnerIntegration(object):
       assert task_runner.status.status == mesos_pb2.TASK_KILLED
 
   @patch('apache.aurora.executor.http_lifecycle.HttpSignaler')
-  def test_integration_http_teardown(self, SignalerClass):
+  def test_integration_http_teardown_killed(self, SignalerClass):
+    """Ensure that the http teardown procedure closes correctly when abort 
kills the process."""
     signaler = SignalerClass.return_value
-    signaler.side_effect = lambda path, use_post_method: (path != 
'/quitquitquit', None)
+    signaler.side_effect = lambda path, use_post_method: (path == 
'/abortabortabort', None)
 
     clock = Mock(wraps=time)
 
+    class TerminalStateStatusRunner(ThermosTaskRunner):
+      """
+      Status is called each poll in the teardown procedure. We return kill 
after the 3rd poll
+      to mimic a task that exits early. We want to ensure the shutdown 
procedure doesn't wait
+      the full time if it doesn't need to.
+      """
+
+      TIMES_CALLED = 0
+
+      @property
+      def status(self):
+        if (self.TIMES_CALLED >= 3):
+          return StatusResult('Test task mock status', mesos_pb2.TASK_KILLED)
+        self.TIMES_CALLED += 1
+
     with self.yield_sleepy(
-        ThermosTaskRunner,
+        TerminalStateStatusRunner,
         portmap={'health': 3141},
         clock=clock,
-        sleep=1000,
+        sleep=0,
         exit_code=0) as task_runner:
 
-      class ImmediateHttpLifecycleManager(HttpLifecycleManager):
-        ESCALATION_WAIT = Amount(1, Time.MICROSECONDS)
+      graceful_shutdown_wait = Amount(1, Time.SECONDS)
+      shutdown_wait = Amount(5, Time.SECONDS)
+      http_task_runner = HttpLifecycleManager(
+          task_runner, 3141, [('/quitquitquit', graceful_shutdown_wait),
+          ('/abortabortabort', shutdown_wait)], clock=clock)
+      http_task_runner.start()
+      task_runner.forked.wait()
+      http_task_runner.stop()
+
+      http_teardown_poll_wait_call = 
call(HttpLifecycleManager.WAIT_POLL_INTERVAL.as_(Time.SECONDS))
+      assert clock.sleep.mock_calls.count(http_teardown_poll_wait_call) == 3  
# Killed before 5
+      assert signaler.mock_calls == [
+        call('/quitquitquit', use_post_method=True),
+        call('/abortabortabort', use_post_method=True)]
+
+  @patch('apache.aurora.executor.http_lifecycle.HttpSignaler')
+  def test_integration_http_teardown_escalate(self, SignalerClass):
+    """Ensure that the http teardown process fully escalates when quit/abort 
both fail to kill."""
+    signaler = SignalerClass.return_value
+    signaler.side_effect = lambda path, use_post_method: (True, None)
+
+    clock = Mock(wraps=time)
+
+    class KillCalledTaskRunner(ThermosTaskRunner):
+      def __init__(self, *args, **kwargs):
+        self._killed_called = False
+        ThermosTaskRunner.__init__(self, *args, **kwargs)
+
+      def kill_called(self):
+        return self._killed_called
+
+      def kill(self):
+        self._killed_called = True
+
+      @property
+      def status(self):
+        return None
+
+    with self.yield_sleepy(
+        KillCalledTaskRunner,
+        portmap={'health': 3141},
+        clock=clock,
+        sleep=0,
+        exit_code=0) as task_runner:
 
-      http_task_runner = ImmediateHttpLifecycleManager(
-          task_runner, 3141, ['/quitquitquit', '/abortabortabort'], 
clock=clock)
+      graceful_shutdown_wait = Amount(1, Time.SECONDS)
+      shutdown_wait = Amount(5, Time.SECONDS)
+      http_task_runner = HttpLifecycleManager(
+          task_runner, 3141, [('/quitquitquit', graceful_shutdown_wait),
+          ('/abortabortabort', shutdown_wait)], clock=clock)
       http_task_runner.start()
       task_runner.forked.wait()
       http_task_runner.stop()
 
-      escalation_wait = 
call(ImmediateHttpLifecycleManager.ESCALATION_WAIT.as_(Time.SECONDS))
-      assert clock.sleep.mock_calls.count(escalation_wait) == 1
+      http_teardown_poll_wait_call = 
call(HttpLifecycleManager.WAIT_POLL_INTERVAL.as_(Time.SECONDS))
+      assert clock.sleep.mock_calls.count(http_teardown_poll_wait_call) == 6
       assert signaler.mock_calls == [
         call('/quitquitquit', use_post_method=True),
         call('/abortabortabort', use_post_method=True)]
+      assert task_runner.kill_called() == True
 
   def test_thermos_normal_exit_status(self):
     with self.exit_with_status(0, TaskState.SUCCESS) as task_runner:

aurora git commit: Adding Configurable Wait Period for Graceful Shutdowns

Reply via email to