> On Oct. 4, 2017, 1:40 p.m., Gaston Kleiman wrote: > > src/tests/default_executor_tests.cpp > > Lines 1273-1277 (patched) > > <https://reviews.apache.org/r/62775/diff/1/?file=1846112#file1846112line1273> > > > > You could use the `KillPolicyTestHelper` with `--sleep_duration=0` > > instead of this. > > > > If you do that, you don't need to mark the test as `ROOT_INTERNET_CURL` > > and it will be run more often. > > Qian Zhang wrote: > Thanks, I have tried it, and found an issue about this test: When I use > the `nginx:alpine` image and ran this test, I found: > ``` > [ RUN ] > MesosContainerizer/DefaultExecutorTest.ROOT_INTERNET_CURL_NoTransitionFromKillingToFinished/0 > ... > I1006 06:32:32.615941 29160 default_executor.cpp:1004] Killing task > 6eacb18a-69f6-48ca-8ffd-4134b75c8525 running in child container > 10f2af6b-232d-485e-86a0-9bbdd419a280.f6c078ba-9c08-4cb0-80fc-18bc154477be > with SIGTERM signal > I1006 06:32:32.616163 29160 default_executor.cpp:1026] Scheduling > escalation to SIGKILL in 3secs from now > I1006 06:32:32.664510 29177 default_executor.cpp:185] Received > ACKNOWLEDGED event > I1006 06:32:32.700647 29180 default_executor.cpp:842] Child container > 10f2af6b-232d-485e-86a0-9bbdd419a280.f6c078ba-9c08-4cb0-80fc-18bc154477be of > task '6eacb18a-69f6-48ca-8ffd-4134b75c8525' in state TASK_KILLED terminated > with signal Terminated > ... > ``` > > And when I use `KillPolicyTestHelper` with `--sleep_duration=0` and run > this test, I found: > ``` > [ RUN ] > MesosContainerizer/DefaultExecutorTest.ROOT_NoTransitionFromKillingToFinished/0 > I1006 06:38:32.138226 30407 default_executor.cpp:1004] Killing task > 636f6b03-b083-4522-813e-aa2122e60915 running in child container > c9af5166-6aee-4cb3-890e-2846fcb1179e.0994b090-5e88-4a36-a63a-3eacbbbabf89 > with SIGTERM signal > I1006 06:38:32.138247 30407 default_executor.cpp:1026] Scheduling > escalation to SIGKILL in 3secs from now > I1006 06:38:32.187152 30421 default_executor.cpp:185] Received > ACKNOWLEDGED event > I1006 06:38:32.225972 30423 default_executor.cpp:842] Child container > c9af5166-6aee-4cb3-890e-2846fcb1179e.0994b090-5e88-4a36-a63a-3eacbbbabf89 of > task '636f6b03-b083-4522-813e-aa2122e60915' in state TASK_KILLED terminated > with signal Killed > ``` > > So as you see, the child container was terminated with either `signal > Terminated` (SIGTERM) or `signal Killed` (SIGKILL), this seems not correct, > because when I run the same task group (`nginx:alpine` or > `KillPolicyTestHelper`) with `mesos-execute`, I see: > ``` > I1006 01:35:42.375380 12666 default_executor.cpp:1004] Killing task test1 > running in child container > 814d48e9-4776-420a-8194-e61c820315be.921436e3-238e-4b86-a644-189c40d44da8 > with SIGTERM signal > I1006 01:35:42.375432 12666 default_executor.cpp:1026] Scheduling > escalation to SIGKILL in 3secs from now > I1006 01:35:42.530655 12635 default_executor.cpp:842] Child container > 814d48e9-4776-420a-8194-e61c820315be.921436e3-238e-4b86-a644-189c40d44da8 of > task 'test1' in state TASK_KILLED exited with status 0 > ``` > This time the child container exited with status 0 rather than terminated > with any signal, I think this is correct behavior. So it seems as long as I > run the test, the task (child container) never exits with 0, this is not what > we expect. I am not sure what's wrong with test, any comments? > > Gaston Kleiman wrote: > Have you tried setting `CommandInfo.shell` to `false` and adding > `--sleep_duration=0` to `CommandInfo.arguments`? If `CommandInfo.shell` is > set to `true`, `sh` will receive the SIGTERM signal and exit instead of > propagating it to the helper. > > Qian Zhang wrote: > Yeah, I have already set it, here is the diff on top of this patch: > ``` > - v1::CommandInfo command; > - command.set_shell(false); > + v1::CommandInfo commandInfo; > + commandInfo.set_shell(false); > + commandInfo.set_value(getTestHelperPath("test-helper")); > + commandInfo.add_arguments("test-helper"); > + commandInfo.add_arguments(KillPolicyTestHelper::NAME); > + commandInfo.add_arguments("--sleep_duration=1"); > > v1::TaskInfo taskInfo = v1::createTask( > agentId, > v1::Resources::parse("cpus:0.1;mem:32;disk:32").get(), > - command); > - > - // The "nginx:alpine" container returns an "EXIT_STATUS" of 0 on > - // receiving a SIGTERM. > - mesos::v1::Image image; > - image.set_type(mesos::v1::Image::DOCKER); > - image.mutable_docker()->set_name("nginx:alpine"); > - > - mesos::v1::ContainerInfo* container = taskInfo.mutable_container(); > - container->set_type(mesos::v1::ContainerInfo::MESOS); > - container->mutable_mesos()->mutable_image()->CopyFrom(image); > + commandInfo); > ``` > > After debugging, I found the nested container status file does not exist: > ``` > ls: cannot access > /tmp/MesosContainerizer_DefaultExecutorTest_ROOT_NoTransitionFromKillingToFinished_0_kztkSB/containers/c649ec7d-972c-4d4c-8c3a-14d6fdc44bce/containers/3cafd0f9-f8c7-49b9-8865-6c74f9483dc0/status: > No such file or directory > ``` > But this file exists when the task is running, I am not sure why it gets > deleted during the test. If this file does not exist, the nested container > will be considered as termiated by SIGKILL (see > https://github.com/apache/mesos/blob/1.4.0/src/slave/containerizer/mesos/containerizer.cpp#L2600). > > Qian Zhang wrote: > Found the root cause of this issue, receiving `TASK_RUNNING` does not > guarantee the task (`test-helper`) is fully started, so if we kill the task > right after receiving `TASK_RUNNING`, the task may not be ready yet (e.g., > the signal handler is not set), that's why `SIGTERM` was not handled as what > we expect. We need to make sure the task is fully started before killing it.
Good catch! Anand did something similar in https://reviews.apache.org/r/61575, but it hasn't been committed yet. - Gaston ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/62775/#review187126 ----------------------------------------------------------- On Oct. 16, 2017, 1:36 a.m., Qian Zhang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/62775/ > ----------------------------------------------------------- > > (Updated Oct. 16, 2017, 1:36 a.m.) > > > Review request for mesos, Anand Mazumdar and Vinod Kone. > > > Bugs: MESOS-7975 > https://issues.apache.org/jira/browse/MESOS-7975 > > > Repository: mesos > > > Description > ------- > > Added a test `ROOT_NoTransitionFromKillingToFinished`. > > > Diffs > ----- > > src/tests/default_executor_tests.cpp > 68312010a45df5dbdb6d9d4c49d1faa5d8c60472 > src/tests/kill_policy_test_helper.cpp > a1880595ff015475f1ba49437d49f7397da19422 > > > Diff: https://reviews.apache.org/r/62775/diff/2/ > > > Testing > ------- > > > Thanks, > > Qian Zhang > >
