Re: Aurora, Thermos, PID 1, and You

Zameer Manji Wed, 02 Nov 2016 13:09:05 -0700

Filed a task https://issues.apache.org/jira/browse/AURORA-1808 to track
this work since there are no objections.


On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zma...@apache.org> wrote:

> Re sending this from my @apache.org email in case my previous email got
> caught in spam.
>
> On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zma...@uber.com> wrote:
>
>> Hey,
>>
>> Recently I have experienced a number of issues in a production
>> environment with the DockerContainerizer, Aurora and Thermos. Although my
>> experience is specific to Docker, I believe this applies to anyone using
>> the Mesos Containerizer with pid isolation. The root cause of these issues
>> originate to the interactions between how we launch the executor, and the
>> role of PID 1.
>>
>> The CommandInfo for the ExecutorInfo uses the default `shell` value which
>> is `true`[1]. This means that in any PID isolated container the `sh`
>> process that launches the executor will become PID 1. Here is an example
>> `ps` output from vagrant showing this:
>> ````
>> root@aurora:/# ps auxf
>> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
>> root       250  0.0  0.0  21928  2124 ?        Ss   01:19   0:00 /bin/bash
>> root       469  0.0  0.0  19176  1240 ?        R+   01:28   0:00  \_ ps
>> auxf
>> root         1  0.0  0.0   4328   636 ?        Ss   01:10   0:00 /bin/sh
>> -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble
>> localhost:2181 --announcer-zookeeper-auth-config
>> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
>> --mesos-containerizer
>> root         5  0.7  1.4 1201128 45604 ?       Sl   01:10   0:08
>> python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble
>> localhost:2181 --announcer-zookeeper-auth-config
>> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
>> --mesos-containerizer-
>> root        23  0.1  0.6 115668 20764 ?        S    01:10   0:01  \_
>> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
>> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
>> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
>> root        29  0.0  0.5 113476 17936 ?        Ss   01:10   0:00      \_
>> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
>> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
>> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
>> root        34  0.0  0.0  20040  1476 ?        S    01:10   0:00      |
>> \_ /bin/bash -c      while true; do       echo hello world       sleep 10
>>   done
>> root       468  0.0  0.0   4228   348 ?        S    01:28   0:00      |
>>     \_ sleep 10
>> root        31  0.0  0.5 113476 17936 ?        Ss   01:10   0:00      \_
>> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
>> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
>> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
>> root        32  0.0  0.0  20040  1476 ?        S    01:10   0:00
>>  \_ /bin/bash -c      while true; do       echo hello world       sleep 10
>>     done
>> root       467  0.0  0.0   4228   352 ?        S    01:28   0:00
>>      \_ sleep 10
>> root        47  0.0  0.0  24116  3052 ?        S    01:10   0:00 python
>> ./daemon.py
>> ````
>>
>> This means processes that double fork/daemonize will be re parented to
>> `sh` and not our executor. You can see that the `python daemon.py` process
>> has been reparented to `sh` and not the executor and is outside of the
>> scope of the runners. This has a number of undesirable implications,
>> perhaps most concerning is that processes that end up reparenting to PID 1
>> will not receive SIGTERM or SIGKILL from thermos but instead will be killed
>> by the kernel when thermos decides to to exit. If anyone here decides to
>> run published images that use popular software that double forks (like
>> nginx), you will never be able to ensure the processes die cleanly.
>>
>> I've been thinking about this problem for a while and upon advice from
>> others and my own research I believe the best solution is as follows:
>> 1. We have good reasons for setting `shell=True` when launching the
>> executor. I'm not comfortable changing this because I'm not sure of all of
>> the implications if we choose another method.
>> 2. The thermos runners end up forking off the target processes. I think
>> the runners should be responsible for all of the processes that are created
>> by the children.
>> 3. We can make the runners responsible for their grand children by using
>> `prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner.
>> This means double forked processes will be reparented to the runner and not
>> PID 1
>> 4. On task tear down, we make the runners send SIGTERM and SIGKILL to the
>> PIDs they recorded and any other children they have.
>> 5. Each runner would need to have a SIGCHLD handler to handle zombie
>> processes that are reparented to it.
>>
>> [1]: https://github.com/apache/aurora/blob/783baaefb9a814ca0
>> 1fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/sche
>> duler/configuration/executor/ExecutorModule.java#L109-L135
>> [2]: http://man7.org/linux/man-pages/man2/prctl.2.html
>>
>> --
>> Zameer Manji
>>
>> --
>> Zameer Manji
>>
>


-- 
Zameer Manji

Re: Aurora, Thermos, PID 1, and You

Reply via email to