[ 
https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15999352#comment-15999352
 ] 

Tomasz Janiszewski commented on MESOS-6933:
-------------------------------------------

This error is quite easy to reproduce.

1. Run Mesos cluster with default configuration (you can use 
{{./build/bin/mesos-local.sh}}). Do not enable any isolators especially 
naespace/pid isolator because it can cover this bug.
2. Create script that works in infinite loop and ignore signals

{code}
cat > /tmp/script.sh <<EOF
#!/bin/sh
trap "echo SIGNAL" HUP INT TERM
while : ; do
  date >> /tmp/date.txt
  sleep 1
done
EOF
{code}

3. Start created script on Mesos and kill it after couple of seconds working. 
You can use any framework e.g., {{ mesos-execute --kill_after=10secs 
--master=localhost:5050 --command="/tmp/script.sh" 
--name="graceful-kill-test"}} 
4. Monitor logs. You can see there that script is signaled with SIGTERM and the 
shell has excited but script is still running and producing output.


The easiest solution will be to signal tree and then wait for all processes in 
this tree to exit, not only the root.

> Executor does not respect grace period
> --------------------------------------
>
>                 Key: MESOS-6933
>                 URL: https://issues.apache.org/jira/browse/MESOS-6933
>             Project: Mesos
>          Issue Type: Bug
>          Components: executor
>            Reporter: Tomasz Janiszewski
>
> Mesos Command Executor try to support grace period with escalate but 
> unfortunately it does not work. It launches {{command}} by wrapping it in 
> {{sh -c}} this cause process tree to look like this
> {code}
> Received killTask
> Shutting down
> Sending SIGTERM to process tree at pid 18
> Sent SIGTERM to the following process trees:
> [ 
> -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
> ./bin/offer-i18n -e prod -p $PORT0 
>  \--- 19 command...
> ]
> Command terminated with signal Terminated (pid: 18)
> {code}
> This cause {{sh}} to immediately close and so executor, while wrapped 
> {{command}} might need some more time to finish. Finally, executor thinks 
> command executed gracefully so it won't 
> [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
>  to SIGKILL.
> This cause leaks when POSIX containerizer is used because if command ignores 
> SIGTERM it will be attached to initialize and never get killed. Using 
> pid/namespace only masks the problem because hanging process is captured 
> before it can gracefully shutdown.
> Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit 
> when all children processes finish. If not they will be killed by escalation 
> to SIGKILL.
> All versions from 0.20 are affected.
> This test should pass 
> [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
> [Mailing list 
> thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to