Re: Review Request 65695: Made the default executor allow schedulers to retry task kills.

Alexander Rukletsov Tue, 20 Mar 2018 05:54:07 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65695/#review199540
-----------------------------------------------------------

A high level question I have is the semantics of a task kill. You retry 
`SIGKILL` but not `SIGTERM`. Why is it so? Do you expect frameworks to retry 
kills at some point of time (which should be less that the kill policy timeout 
to make sense) if they do not receive a terminal status update? Since the 
escalation timer is started anyway, is it correct to transition the container 
back to `killing=false`? This can also lead to multiple `SIGKILL` requests 
going to the agent for the same container, which is likely not a big deal, but 
maybe is worth checking?

My suggestion would be to retry everything ourselves and do not expect 
frameworks to do this for us. When we faced a similar problem with the docker 
executor, we decided that once the task has been transitioned to `killed`, 
there is no way back, even if the kill attempt failed.

src/launcher/default_executor.cpp
Lines 1242-1243 (original)
<https://reviews.apache.org/r/65695/#comment279827>

    Is this change related to your chain? Or does this fix a different issue 
with the broken invariant?

src/launcher/default_executor.cpp
Lines 1278 (patched)
<https://reviews.apache.org/r/65695/#comment279826>

    This now means that the executor sends `TASK_KILLING` status update and 
then might unset the `killing` flag. Can you please add a comment for 
`DefaultExecutor::Container::killing` saying this?

- Alexander Rukletsov

On March 19, 2018, 4:36 p.m., Gaston Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65695/
> -----------------------------------------------------------
> 
> (Updated March 19, 2018, 4:36 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Joseph Wu, Qian Zhang, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8530
>     https://issues.apache.org/jira/browse/MESOS-8530
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The default executor transitions a task to `TASK_KILLING` and marks its
> child container as being killed before posting a `KILL` call to the
> agent.
> 
> The executor ignores kill requests for containers that are marked as
> being killed, and it doesn't remove this mark if the `KILL` call fails.
> This means that it's possible for tasks to get stuck in a `TASK_KILLING`
> state.
> 
> This patch makes the default executor remove the killing mark if a
> `KILL` call fails. That way a scheduler can retry a kill.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp 906836f3b8e0af79d7c61f90fd8a95f193b26e84 
> 
> 
> Diff: https://reviews.apache.org/r/65695/diff/3/
> 
> 
> Testing
> -------
> 
> `sudo bin/mesos-tests.sh` on GNU/Linux
> 
> 
> Thanks,
> 
> Gaston Kleiman
> 
>

Re: Review Request 65695: Made the default executor allow schedulers to retry task kills.

Reply via email to