----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/44657/ -----------------------------------------------------------
(Updated March 23, 2016, 4:56 p.m.) Review request for mesos, Ben Mahler and Gilbert Song. Changes ------- Updated testing with manual tests. Bugs: MESOS-4909 https://issues.apache.org/jira/browse/MESOS-4909 Repository: mesos Description (updated) ------- The command executor determines how much time it allots the underlying task to clean up (effectively how long to wait for the task to comply to SIGTERM before sending SIGKILL) based on both optional task's `KillPolicy` and optional `shutdown_grace_period` field in `ExecutorInfo`. Manual testing was performed to ensure newly introduced protobuf fields are respected. To do that, "mesos-execute" was modified to support `KillPolicy` and `CommandInfo.shell=false`. To simulate a task that does not exit in the allotted period, a tiny app (https://github.com/rukletsov/unresponsive-process) that ignores SIGTERM was used. Diffs ----- src/launcher/executor.cpp 2df62f09637b55c63ae6fe5d5a70b8debc02fbe2 src/slave/slave.cpp 840534ff0687e82ed063c386e36bbabada230697 Diff: https://reviews.apache.org/r/44657/diff/ Testing (updated) ------- To test the newly introduced protobuf fields, I've modified "mesos-execute" to support these fields. I've inserted these lines in `resourceOffers()` method: ``` task.mutable_kill_policy()->mutable_grace_period()-> set_nanoseconds(Seconds(<value>).ns()); ``` I've also implemented a "--no-shell" flag to support setting `CommandInfo.shell` to `false`. An example command used in testing: ``` ./src/mesos-execute --master=127.0.0.1:5050 --name=test --command="/Users/alex/bin/unresponsive_process" --no-shell ``` In order to test the signal escalation path, which is what the command executor implements currently for kill policy, I've created a tiny application that ignores SIGTERM (https://github.com/rukletsov/unresponsive-process ). Test 1.1: ========= `KillPolicy.grace_period` is **not set**, `shell=false`, framework shutdown. Framework is asked to shutdown several seconds after the task is started. This results in agent asking executor to shutdown. From the agent's logs, we see that the command executor sends a TASK_KILLED update in approx. **3 seconds** (the default value is **5 seconds**, but we deduct some buffer to make sure the task can be reaped and TASK_KILLED is sent) after it has been asked to shut down: ``` I0323 17:24:48.704677 418881536 slave.cpp:2080] Asked to shut down framework 9eb233b5-dfe8-4030-a989-57f31ec0e3aa-0000 by master@127.0.0.1:5050 I0323 17:24:48.704710 418881536 slave.cpp:2105] Shutting down framework 9eb233b5-dfe8-4030-a989-57f31ec0e3aa-0000 I0323 17:24:48.704756 418881536 slave.cpp:4216] Shutting down executor 'test' of framework 9eb233b5-dfe8-4030-a989-57f31ec0e3aa-0000 at executor(1)@192.168.178.24:50933 I0323 17:24:51.817987 419418112 slave.cpp:3003] Handling status update TASK_KILLED (UUID: 9cf799dd-3492-42c2-9b43-fcb7283ebec9) for task test of framework 9eb233b5-dfe8-4030-a989-57f31ec0e3aa-0000 from executor(1)@192.168.178.24:50933 ``` Excerpt from the executor's log we see, that `CommandExecutor::escalated()` has been called in 3 seconds: ``` Shutting down Sending SIGTERM to process tree at pid 23632 9276662428399220356 Sent SIGTERM to the following process trees: [ --- 23632 /Users/alex/bin/unresponsive_process ] 10137513391590427691 12741826412569908689 Process 23632 did not terminate after 3secs, sending SIGKILL to process tree at 23632 16972857197258033356 Killed the following process trees: [ --- 23632 /Users/alex/bin/unresponsive_process ] Command terminated with signal Killed: 9 (pid: 23632) ``` Test 1.2: ========= `KillPolicy.grace_period` is **not set**, `shell=true`, framework shutdown. The only difference to test 1.1 is `shell=true`. Similar behavior is observed. Test 2.1: ========= `KillPolicy.grace_period` set to **0s**, `shell=false`, framework shutdown. Similar to test 1.1, but the escalation timeout is set to 0s (no deductions because the kill policy is set explicitly; the grace period in containerizer should be adjusted to approx. 2 seconds): ``` I0323 17:38:07.650616 189734912 slave.cpp:2080] Asked to shut down framework 0dbc9e01-89ef-4cff-8c64-cdb3b46e9b2a-0000 by master@127.0.0.1:5050 I0323 17:38:07.650647 189734912 slave.cpp:2105] Shutting down framework 0dbc9e01-89ef-4cff-8c64-cdb3b46e9b2a-0000 I0323 17:38:07.650686 189734912 slave.cpp:4216] Shutting down executor 'test' of framework 0dbc9e01-89ef-4cff-8c64-cdb3b46e9b2a-0000 at executor(1)@192.168.178.24:51836 I0323 17:38:07.823477 187052032 slave.cpp:3003] Handling status update TASK_KILLED (UUID: 56aad190-2ea2-4bb2-8d61-7a49ff42d4ba) for task test of framework 0dbc9e01-89ef-4cff-8c64-cdb3b46e9b2a-0000 from executor(1)@192.168.178.24:51836 ``` Excerpt from the executor's log we see, that `CommandExecutor::escalated()` has been called in 0 seconds: ``` Shutting down Sending SIGTERM to process tree at pid 24075 Sent SIGTERM to the following process trees: [ 10370891801885978953 --- 24075 /Users/alex/bin/unresponsive_process ] Process 24075 did not terminate after 0ns, sending SIGKILL to process tree at 24075 Killed the following process trees: [ --- 24075 /Users/alex/bin/unresponsive_process ] Command terminated with signal Killed: 9 (pid: 24075) ``` Test 2.2: ========= `KillPolicy.grace_period` set to **0s**, `shell=true`, framework shutdown The only difference to test 2.1 is `shell=true`. Similar behavior is observed. Test 3.1: ========= `KillPolicy.grace_period` set to **11s**, `shell=false`, framework shutdown Similar to test 1.1, but the escalation timeout is set to 11 seconds (no deductions because the kill policy is set explicitly): ``` I0323 16:59:19.218186 189407232 slave.cpp:2105] Shutting down framework 8bd2df7e-5659-4887-800e-604db3694e17-0000 I0323 16:59:19.218214 189407232 slave.cpp:4216] Shutting down executor 'test' of framework 8bd2df7e-5659-4887-800e-604db3694e17-0000 at executor(1)@192.168.178.24:65526 I0323 16:59:30.353431 190480384 slave.cpp:3003] Handling status update TASK_KILLED (UUID: e7a5fcfb-33a2-4341-8ca4-d372c754020c) for task test of framework 8bd2df7e-5659-4887-800e-604db3694e17-0000 from executor(1)@192.168.178.24:65526 ``` Excerpt from the executor's log we see, that `CommandExecutor::escalated()` has been called in 11 seconds: ``` Shutting down Sending SIGTERM to process tree at pid 23173 Sent SIGTERM to the following process trees: 5447330384702360843 [ --- 23173 /Users/alex/bin/unresponsive_process ] 9421153335942970033 12365643494559189178 7584901102993551143 12120587866343901205 9383577596543058034 10156641407552912845 12050873774172063028 11004056749756330400 13931472820830639858 16620123046684546445 Process 23173 did not terminate after 11secs, sending SIGKILL to process tree at 23173 Killed the following process trees: [ --- 23173 /Users/alex/bin/unresponsive_process ] Command terminated with signal Killed: 9 (pid: 23173) ``` Test 3.2: ========= `KillPolicy.grace_period` set to **11s**, `shell=true`, framework shutdown The only difference to test 3.1 is `shell=true`. Similar behavior is observed. Thanks, Alexander Rukletsov