> On May 5, 2016, 8:25 p.m., Ben Mahler wrote: > > src/launcher/http_command_executor.cpp, lines 644-645 > > <https://reviews.apache.org/r/46325/diff/5/?file=1373555#file1373555line644> > > > > How about: > > > > ``` > > cout << "Received a new kill policy grace period of << gracePeriod << > > "; shortening remaining grace period time to " << remaining; > > ``` > > Alexander Rukletsov wrote: > This is: > * Not symmetrical to the log message when we issue a kill for the first > time; > * Not correct, because the total grace period is `remaining + elapsed`. > > Why do you think this level of detail is important in the log? We've > already have a log entry per each kill.
As an altertnative, let's print it when `kill()` is called. > On May 5, 2016, 8:25 p.m., Ben Mahler wrote: > > src/launcher/http_command_executor.cpp, lines 569-575 > > <https://reviews.apache.org/r/46325/diff/5/?file=1373555#file1373555line569> > > > > It looks like we have to re-assign kill_policy! Otherwise, if kill is > > called internally (e.g. health check failure) after a Kill message with a > > kill_policy arrives, we may accidentally change the grace period! > > > > Something like: > > > > ``` > > if (override.isSome()) { > > killPolicy = override.get(); > > } > > ``` > > > > If we don't re-assign we will take the old one from TaskInfo when kill > > is called by the executor itself. > > Alexander Rukletsov wrote: > This was intentional. If the health check fails during the grace > shutdown, but the task has not been reaped yet, doesn't it makes sense to > kill it faster? Add a TODO in health checks about kill policy. - Alexander ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/46325/#review131890 ----------------------------------------------------------- On May 7, 2016, 1:03 p.m., Alexander Rukletsov wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/46325/ > ----------------------------------------------------------- > > (Updated May 7, 2016, 1:03 p.m.) > > > Review request for mesos, Ben Mahler and Qian Zhang. > > > Bugs: MESOS-4908 > https://issues.apache.org/jira/browse/MESOS-4908 > > > Repository: mesos > > > Description > ------- > > Kill policy can be provided in a kill event. In this case it should > take precedence over kill policy specified when the task was launched. > When kill event is issued multiple times during the task termination, > the signal escalation timeout (the time a task has between SIGTERM > and SIGKILL) may be reduced. > > Since updating the delay timer (we use it for signal escalation delay) > is currently not possible, we cancel the existing signal timer and set > up a new one. `Clock::cancel()` guarantees that, if existed, the timer > is removed before the function returns; hence we do not set up more > than 1 timer for signal escalation delay. > > > Diffs > ----- > > src/launcher/http_command_executor.cpp > d2f15b0447d91f3a4cd92f07114cb366647cc7d3 > > Diff: https://reviews.apache.org/r/46325/diff/ > > > Testing > ------- > > `make check` > > Tested manually using modified `mesos-execute` in a way that **two** extra > kill task requests are sent, 2s and 3s after receiving `TASK_RUNNING`. Each > kill task request specifies `KillPolicy` with 1s grace period. Together with > a kill *without* a kill policy scheduled 1s after the task is being launched, > the task receives **three** kill requests in total. > > Master: `./bin/mesos-master.sh --work_dir=/tmp/w/m --ip=127.0.0.1` > Agent: `./bin/mesos-slave.sh --work_dir=/tmp/w/s --master=127.0.0.1:5050 > --http_command_executor` > Mesos-execute: `./src/mesos-execute --master=127.0.0.1:5050 --name=test > --command="/Users/alex/bin/unresponsive_process" --env='{"GLOG_v": "2"}' > --kill_after=1secs` > > HTTP command executor log: > ``` > Received SUBSCRIBED event > Subscribed executor on alexr.fritz.box > Received LAUNCH event > Starting task test > sh -c '/Users/alex/bin/unresponsive_process' > Forked command at 17475 > 14455919081943275466 > Received ACKNOWLEDGED event > 17172602460659762152 > Received KILL event > Received kill for task test > Sending SIGTERM to process tree at pid 17475 > Sent SIGTERM to the following process trees: > [ > --- 17475 /Users/alex/bin/unresponsive_process > ] > 4381544758593790168 > Scheduling escalation to SIGKILL in 3secs from now > Received ACKNOWLEDGED event > Received KILL event > Received kill for task test > Rescheduling escalation to SIGKILL in 1secs from now > 10370891801885978953 > Process 17475 did not terminate after 1secs, sending SIGKILL to process tree > at 17475 > Killed the following process trees: > [ > --- 17475 /Users/alex/bin/unresponsive_process > ] > Received KILL event > Received kill for task test > Command terminated with signal Killed: 9 (pid: 17475) > ``` > > Excerpt from the agent log that shows all 3 kill task requests and that the > segnal escalation timeout was reduced from 3s to 1s: > ``` > I0418 14:27:17.825070 244285440 slave.cpp:3605] Forwarding the update > TASK_RUNNING (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of > framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050 > I0418 14:27:17.831233 242139136 status_update_manager.cpp:392] Received > status update acknowledgement (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) > for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 > I0418 14:27:18.834309 244285440 slave.cpp:2046] Asked to kill task test of > framework ab374773-a018-4531-923b-899cf1e4f573-0000 > I0418 14:27:18.842150 244285440 http.cpp:178] HTTP POST for > /slave(1)/api/v1/executor from 192.168.178.24:54206 > I0418 14:27:18.842331 244285440 slave.cpp:3207] Handling status update > TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of > framework ab374773-a018-4531-923b-899cf1e4f573-0000 > I0418 14:27:18.843214 242139136 status_update_manager.cpp:320] Received > status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for > task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 > I0418 14:27:18.843387 243748864 slave.cpp:3605] Forwarding the update > TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of > framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050 > I0418 14:27:18.846459 242675712 status_update_manager.cpp:392] Received > status update acknowledgement (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) > for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 > I0418 14:27:19.836699 240529408 slave.cpp:2046] Asked to kill task test of > framework ab374773-a018-4531-923b-899cf1e4f573-0000 > I0418 14:27:20.850658 240529408 slave.cpp:2046] Asked to kill task test of > framework ab374773-a018-4531-923b-899cf1e4f573-0000 > I0418 14:27:20.927338 241602560 http.cpp:178] HTTP POST for > /slave(1)/api/v1/executor from 192.168.178.24:54206 > I0418 14:27:20.927465 241602560 slave.cpp:3207] Handling status update > TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of > framework ab374773-a018-4531-923b-899cf1e4f573-0000 > I0418 14:27:20.928771 240529408 status_update_manager.cpp:320] Received > status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for > task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 > I0418 14:27:20.928933 243748864 slave.cpp:3605] Forwarding the update > TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of > framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050 > > ``` > > > Thanks, > > Alexander Rukletsov > >