> On Oct. 4, 2017, 1:40 p.m., Gaston Kleiman wrote:
> > src/tests/default_executor_tests.cpp
> > Lines 1273-1277 (patched)
> > <https://reviews.apache.org/r/62775/diff/1/?file=1846112#file1846112line1273>
> >
> >     You could use the `KillPolicyTestHelper` with `--sleep_duration=0` 
> > instead of this.
> >     
> >     If you do that, you don't need to mark the test as `ROOT_INTERNET_CURL` 
> > and it will be run more often.
> 
> Qian Zhang wrote:
>     Thanks, I have tried it, and found an issue about this test: When I use 
> the `nginx:alpine` image and ran this test, I found:
>     ```
>     [ RUN      ] 
> MesosContainerizer/DefaultExecutorTest.ROOT_INTERNET_CURL_NoTransitionFromKillingToFinished/0
>     ...
>     I1006 06:32:32.615941 29160 default_executor.cpp:1004] Killing task 
> 6eacb18a-69f6-48ca-8ffd-4134b75c8525 running in child container 
> 10f2af6b-232d-485e-86a0-9bbdd419a280.f6c078ba-9c08-4cb0-80fc-18bc154477be 
> with SIGTERM signal
>     I1006 06:32:32.616163 29160 default_executor.cpp:1026] Scheduling 
> escalation to SIGKILL in 3secs from now
>     I1006 06:32:32.664510 29177 default_executor.cpp:185] Received 
> ACKNOWLEDGED event
>     I1006 06:32:32.700647 29180 default_executor.cpp:842] Child container 
> 10f2af6b-232d-485e-86a0-9bbdd419a280.f6c078ba-9c08-4cb0-80fc-18bc154477be of 
> task '6eacb18a-69f6-48ca-8ffd-4134b75c8525' in state TASK_KILLED terminated 
> with signal Terminated
>     ...
>     ```
>     
>     And when I use `KillPolicyTestHelper` with `--sleep_duration=0` and run 
> this test, I found:
>     ```
>     [ RUN      ] 
> MesosContainerizer/DefaultExecutorTest.ROOT_NoTransitionFromKillingToFinished/0
>     I1006 06:38:32.138226 30407 default_executor.cpp:1004] Killing task 
> 636f6b03-b083-4522-813e-aa2122e60915 running in child container 
> c9af5166-6aee-4cb3-890e-2846fcb1179e.0994b090-5e88-4a36-a63a-3eacbbbabf89 
> with SIGTERM signal
>     I1006 06:38:32.138247 30407 default_executor.cpp:1026] Scheduling 
> escalation to SIGKILL in 3secs from now
>     I1006 06:38:32.187152 30421 default_executor.cpp:185] Received 
> ACKNOWLEDGED event
>     I1006 06:38:32.225972 30423 default_executor.cpp:842] Child container 
> c9af5166-6aee-4cb3-890e-2846fcb1179e.0994b090-5e88-4a36-a63a-3eacbbbabf89 of 
> task '636f6b03-b083-4522-813e-aa2122e60915' in state TASK_KILLED terminated 
> with signal Killed
>     ```
>     
>     So as you see, the child container was terminated with either `signal 
> Terminated` (SIGTERM) or `signal Killed` (SIGKILL), this seems not correct, 
> because when I run the same task group (`nginx:alpine` or 
> `KillPolicyTestHelper`) with `mesos-execute`, I see:
>     ```
>     I1006 01:35:42.375380 12666 default_executor.cpp:1004] Killing task test1 
> running in child container 
> 814d48e9-4776-420a-8194-e61c820315be.921436e3-238e-4b86-a644-189c40d44da8 
> with SIGTERM signal
>     I1006 01:35:42.375432 12666 default_executor.cpp:1026] Scheduling 
> escalation to SIGKILL in 3secs from now
>     I1006 01:35:42.530655 12635 default_executor.cpp:842] Child container 
> 814d48e9-4776-420a-8194-e61c820315be.921436e3-238e-4b86-a644-189c40d44da8 of 
> task 'test1' in state TASK_KILLED exited with status 0
>     ```
>     This time the child container exited with status 0 rather than terminated 
> with any signal, I think this is correct behavior. So it seems as long as I 
> run the test, the task (child container) never exits with 0, this is not what 
> we expect. I am not sure what's wrong with test, any comments?
> 
> Gaston Kleiman wrote:
>     Have you tried setting `CommandInfo.shell` to `false` and adding 
> `--sleep_duration=0` to `CommandInfo.arguments`? If `CommandInfo.shell` is 
> set to `true`, `sh` will receive the SIGTERM signal and exit instead of 
> propagating it to the helper.
> 
> Qian Zhang wrote:
>     Yeah, I have already set it, here is the diff on top of this patch:
>     ```
>     -  v1::CommandInfo command;
>     -  command.set_shell(false);
>     +  v1::CommandInfo commandInfo;
>     +  commandInfo.set_shell(false);
>     +  commandInfo.set_value(getTestHelperPath("test-helper"));
>     +  commandInfo.add_arguments("test-helper");
>     +  commandInfo.add_arguments(KillPolicyTestHelper::NAME);
>     +  commandInfo.add_arguments("--sleep_duration=1");
>      
>        v1::TaskInfo taskInfo = v1::createTask(
>            agentId,
>            v1::Resources::parse("cpus:0.1;mem:32;disk:32").get(),
>     -      command);
>     -
>     -  // The "nginx:alpine" container returns an "EXIT_STATUS" of 0 on
>     -  // receiving a SIGTERM.
>     -  mesos::v1::Image image;
>     -  image.set_type(mesos::v1::Image::DOCKER);
>     -  image.mutable_docker()->set_name("nginx:alpine");
>     -
>     -  mesos::v1::ContainerInfo* container = taskInfo.mutable_container();
>     -  container->set_type(mesos::v1::ContainerInfo::MESOS);
>     -  container->mutable_mesos()->mutable_image()->CopyFrom(image);
>     +      commandInfo);
>     ```
>     
>     After debugging, I found the nested container status file does not exist:
>     ```
>     ls: cannot access 
> /tmp/MesosContainerizer_DefaultExecutorTest_ROOT_NoTransitionFromKillingToFinished_0_kztkSB/containers/c649ec7d-972c-4d4c-8c3a-14d6fdc44bce/containers/3cafd0f9-f8c7-49b9-8865-6c74f9483dc0/status:
>  No such file or directory
>     ```
>     But this file exists when the task is running, I am not sure why it gets 
> deleted during the test. If this file does not exist, the nested container 
> will be considered as termiated by SIGKILL (see 
> https://github.com/apache/mesos/blob/1.4.0/src/slave/containerizer/mesos/containerizer.cpp#L2600).
> 
> Qian Zhang wrote:
>     Found the root cause of this issue, receiving `TASK_RUNNING` does not 
> guarantee the task (`test-helper`) is fully started, so if we kill the task 
> right after receiving `TASK_RUNNING`, the task may not be ready yet (e.g., 
> the signal handler is not set), that's why `SIGTERM` was not handled as what 
> we expect. We need to make sure the task is fully started before killing it.

Good catch! Anand did something similar in https://reviews.apache.org/r/61575, 
but it hasn't been committed yet.


- Gaston


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/62775/#review187126
-----------------------------------------------------------


On Oct. 16, 2017, 1:36 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/62775/
> -----------------------------------------------------------
> 
> (Updated Oct. 16, 2017, 1:36 a.m.)
> 
> 
> Review request for mesos, Anand Mazumdar and Vinod Kone.
> 
> 
> Bugs: MESOS-7975
>     https://issues.apache.org/jira/browse/MESOS-7975
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added a test `ROOT_NoTransitionFromKillingToFinished`.
> 
> 
> Diffs
> -----
> 
>   src/tests/default_executor_tests.cpp 
> 68312010a45df5dbdb6d9d4c49d1faa5d8c60472 
>   src/tests/kill_policy_test_helper.cpp 
> a1880595ff015475f1ba49437d49f7397da19422 
> 
> 
> Diff: https://reviews.apache.org/r/62775/diff/2/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>

Reply via email to