Re: Review Request 46325: Updated HTTP command executor to support kill policy in Kill event.

Ben Mahler Thu, 05 May 2016 13:26:55 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46325/#review131890
-----------------------------------------------------------





src/launcher/http_command_executor.cpp (lines 569 - 575)
<https://reviews.apache.org/r/46325/#comment195958>

    It looks like we have to re-assign kill_policy! Otherwise, if kill is 
called internally (e.g. health check failure) after a Kill message with a 
kill_policy arrives, we may accidentally change the grace period!
    
    Something like:
    
    ```
    if (override.isSome()) {
      killPolicy = override.get();
    }
    ```
    
    If we don't re-assign we will take the old one from TaskInfo when kill is 
called by the executor itself.



src/launcher/http_command_executor.cpp (line 615)
<https://reviews.apache.org/r/46325/#comment195950>

    How about?
    
    ```
    // If the task is being killed but hasn't terminated and
    // we receive another kill request, we see if we need
    // to adjust the remaining grace period.
    ```



src/launcher/http_command_executor.cpp (line 617)
<https://reviews.apache.org/r/46325/#comment195957>

    grace period



src/launcher/http_command_executor.cpp (lines 622 - 625)
<https://reviews.apache.org/r/46325/#comment195956>

    The second bullet isn't really a limitation here right?
    
    Since we track when the grace period started we're now just adjusting the 
end of the grace period to reflect the newly desired duration, so we can easily 
support shrinking and growing.



src/launcher/http_command_executor.cpp (lines 627 - 633)
<https://reviews.apache.org/r/46325/#comment195951>

    This is helpful, thanks! Some adjustment suggestions:
    
    ```
          // Here are some examples to illustrate:
          //
          // 20, 30 -> Increased grace period is a no-op, grace period remains 
20.
          // 20, 20 -> Retries are a no-op, grace period remains 20.
          // 20, 5  -> if elapsed >= 5:
          //             SIGKILL immediately (total grace period is elapsed).
          //           if elapsed < 5:
          //             SIGKILL in 5 - elapsed (total grace period is 5).
    ```



src/launcher/http_command_executor.cpp (lines 635 - 637)
<https://reviews.apache.org/r/46325/#comment195955>

    How about re-ordering this?
    
    ```
    if (killingInitiated + gracePeriod > escalationTimer.timeout().time()) {
      return;
    }
    ```
    
    Since this is an "increase" it seems nice to read that the new grace period 
end is "larger" than the existing one.



src/launcher/http_command_executor.cpp (lines 639 - 642)
<https://reviews.apache.org/r/46325/#comment195949>

    "elasped" and "remaining" are more complimentary here, 
"overriddenGracePeriod" suggests the grace period but it's really just the 
remaining time on the timer:
    
    ```
          Duration elapsed = Clock::now() - killGracePeriodStart;
          Duration remaining = gracePeriod - elapsed;
    ```
    
    Why did you have the Duration::zero guard, you shouldn't need it?



src/launcher/http_command_executor.cpp (lines 644 - 645)
<https://reviews.apache.org/r/46325/#comment195948>

    How about:
    
    ```
    cout << "Received a new kill policy grace period of << gracePeriod << "; 
shortening remaining grace period time to " << remaining;
    ```



src/launcher/http_command_executor.cpp (lines 687 - 690)
<https://reviews.apache.org/r/46325/#comment195947>

    It would be nice to have `killingInitiated + gracePeriod == 
escalationTimer.timeout().time()` but they are slightly off right now:
    
    killingInitiated = 0.9
    escalationTimer = 1.0 + 5 (gracePeriod)
    
    This means your code above will tend to print out "re-scheduling" on the 
first retry.
    
    One suggestion is to store killingInitiated after you start the timer to 
avoid this. Ideally, we could do this:
    
    ```
    killGracePeriodTimer = delay();
    killGracePeriodStart = timer.start_time();
    ```
    
    But we can't get the start time out of a timer currently.



src/launcher/http_command_executor.cpp (lines 881 - 884)
<https://reviews.apache.org/r/46325/#comment195929>

    Per our offline chat, readers here likely won't be wondering why there are 
two variables for these (that has a lot of context of the road you went down 
trying to consolidate them into just the timer variable). Likely they're more 
interested in what these represent.
    
    If we change the naming, I think we can go without needing a comment here:
    
    ```
    Option<Time> killGracePeriodStart;
    Option<Timer> killGracePeriodTimer;
    ```


- Ben Mahler


On May 5, 2016, 3:39 p.m., Alexander Rukletsov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46325/
> -----------------------------------------------------------
> 
> (Updated May 5, 2016, 3:39 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Qian Zhang.
> 
> 
> Bugs: MESOS-4908
>     https://issues.apache.org/jira/browse/MESOS-4908
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Kill policy can be provided in a kill event. In this case it should
> take precedence over kill policy specified when the task was launched.
> When kill event is issued multiple times during the task termination,
> the signal escalation timeout (the time a task has between SIGTERM
> and SIGKILL) may be reduced.
> 
> Since updating the delay timer (we use it for signal escalation delay)
> is currently not possible, we cancel the existing signal timer and set
> up a new one. `Clock::cancel()` guarantees that, if existed, the timer
> is removed before the function returns; hence we do not set up more
> than 1 timer for signal escalation delay.
> 
> 
> Diffs
> -----
> 
>   src/launcher/http_command_executor.cpp 
> d2f15b0447d91f3a4cd92f07114cb366647cc7d3 
> 
> Diff: https://reviews.apache.org/r/46325/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Tested manually using modified `mesos-execute` in a way that **two** extra 
> kill task requests are sent, 2s and 3s after receiving `TASK_RUNNING`. Each 
> kill task request specifies `KillPolicy` with 1s grace period. Together with 
> a kill *without* a kill policy scheduled 1s after the task is being launched, 
> the task receives **three** kill requests in total.
> 
> Master: `./bin/mesos-master.sh --work_dir=/tmp/w/m --ip=127.0.0.1`
> Agent: `./bin/mesos-slave.sh --work_dir=/tmp/w/s --master=127.0.0.1:5050 
> --http_command_executor`
> Mesos-execute: `./src/mesos-execute --master=127.0.0.1:5050 --name=test 
> --command="/Users/alex/bin/unresponsive_process" --env='{"GLOG_v": "2"}' 
> --kill_after=1secs`
> 
> HTTP command executor log:
> ```
> Received SUBSCRIBED event
> Subscribed executor on alexr.fritz.box
> Received LAUNCH event
> Starting task test
> sh -c '/Users/alex/bin/unresponsive_process'
> Forked command at 17475
> 14455919081943275466
> Received ACKNOWLEDGED event
> 17172602460659762152
> Received KILL event
> Received kill for task test
> Sending SIGTERM to process tree at pid 17475
> Sent SIGTERM to the following process trees:
> [ 
> --- 17475 /Users/alex/bin/unresponsive_process
> ]
> 4381544758593790168
> Scheduling escalation to SIGKILL in 3secs from now
> Received ACKNOWLEDGED event
> Received KILL event
> Received kill for task test
> Rescheduling escalation to SIGKILL in 1secs from now
> 10370891801885978953
> Process 17475 did not terminate after 1secs, sending SIGKILL to process tree 
> at 17475
> Killed the following process trees:
> [ 
> --- 17475 /Users/alex/bin/unresponsive_process
> ]
> Received KILL event
> Received kill for task test
> Command terminated with signal Killed: 9 (pid: 17475)
> ```
> 
> Excerpt from the agent log that shows all 3 kill task requests and that the 
> segnal escalation timeout was reduced from 3s to 1s:
> ```
> I0418 14:27:17.825070 244285440 slave.cpp:3605] Forwarding the update 
> TASK_RUNNING (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of 
> framework ab374773-a018-4531-923b-899cf1e4f573-0000 to [email protected]:5050
> I0418 14:27:17.831233 242139136 status_update_manager.cpp:392] Received 
> status update acknowledgement (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) 
> for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.834309 244285440 slave.cpp:2046] Asked to kill task test of 
> framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.842150 244285440 http.cpp:178] HTTP POST for 
> /slave(1)/api/v1/executor from 192.168.178.24:54206
> I0418 14:27:18.842331 244285440 slave.cpp:3207] Handling status update 
> TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of 
> framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.843214 242139136 status_update_manager.cpp:320] Received 
> status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for 
> task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.843387 243748864 slave.cpp:3605] Forwarding the update 
> TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of 
> framework ab374773-a018-4531-923b-899cf1e4f573-0000 to [email protected]:5050
> I0418 14:27:18.846459 242675712 status_update_manager.cpp:392] Received 
> status update acknowledgement (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) 
> for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:19.836699 240529408 slave.cpp:2046] Asked to kill task test of 
> framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.850658 240529408 slave.cpp:2046] Asked to kill task test of 
> framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.927338 241602560 http.cpp:178] HTTP POST for 
> /slave(1)/api/v1/executor from 192.168.178.24:54206
> I0418 14:27:20.927465 241602560 slave.cpp:3207] Handling status update 
> TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of 
> framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.928771 240529408 status_update_manager.cpp:320] Received 
> status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for 
> task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.928933 243748864 slave.cpp:3605] Forwarding the update 
> TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of 
> framework ab374773-a018-4531-923b-899cf1e4f573-0000 to [email protected]:5050
> 
> ```
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>

Re: Review Request 46325: Updated HTTP command executor to support kill policy in Kill event.

Reply via email to