----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65695/#review199540 -----------------------------------------------------------
A high level question I have is the semantics of a task kill. You retry `SIGKILL` but not `SIGTERM`. Why is it so? Do you expect frameworks to retry kills at some point of time (which should be less that the kill policy timeout to make sense) if they do not receive a terminal status update? Since the escalation timer is started anyway, is it correct to transition the container back to `killing=false`? This can also lead to multiple `SIGKILL` requests going to the agent for the same container, which is likely not a big deal, but maybe is worth checking? My suggestion would be to retry everything ourselves and do not expect frameworks to do this for us. When we faced a similar problem with the docker executor, we decided that once the task has been transitioned to `killed`, there is no way back, even if the kill attempt failed. src/launcher/default_executor.cpp Lines 1242-1243 (original) <https://reviews.apache.org/r/65695/#comment279827> Is this change related to your chain? Or does this fix a different issue with the broken invariant? src/launcher/default_executor.cpp Lines 1278 (patched) <https://reviews.apache.org/r/65695/#comment279826> This now means that the executor sends `TASK_KILLING` status update and then might unset the `killing` flag. Can you please add a comment for `DefaultExecutor::Container::killing` saying this? - Alexander Rukletsov On March 19, 2018, 4:36 p.m., Gaston Kleiman wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65695/ > ----------------------------------------------------------- > > (Updated March 19, 2018, 4:36 p.m.) > > > Review request for mesos, Alexander Rukletsov, Joseph Wu, Qian Zhang, and > Vinod Kone. > > > Bugs: MESOS-8530 > https://issues.apache.org/jira/browse/MESOS-8530 > > > Repository: mesos > > > Description > ------- > > The default executor transitions a task to `TASK_KILLING` and marks its > child container as being killed before posting a `KILL` call to the > agent. > > The executor ignores kill requests for containers that are marked as > being killed, and it doesn't remove this mark if the `KILL` call fails. > This means that it's possible for tasks to get stuck in a `TASK_KILLING` > state. > > This patch makes the default executor remove the killing mark if a > `KILL` call fails. That way a scheduler can retry a kill. > > > Diffs > ----- > > src/launcher/default_executor.cpp 906836f3b8e0af79d7c61f90fd8a95f193b26e84 > > > Diff: https://reviews.apache.org/r/65695/diff/3/ > > > Testing > ------- > > `sudo bin/mesos-tests.sh` on GNU/Linux > > > Thanks, > > Gaston Kleiman > >
