Filed a task https://issues.apache.org/jira/browse/AURORA-1808 to track this work since there are no objections.
On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zma...@apache.org> wrote: > Re sending this from my @apache.org email in case my previous email got > caught in spam. > > On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zma...@uber.com> wrote: > >> Hey, >> >> Recently I have experienced a number of issues in a production >> environment with the DockerContainerizer, Aurora and Thermos. Although my >> experience is specific to Docker, I believe this applies to anyone using >> the Mesos Containerizer with pid isolation. The root cause of these issues >> originate to the interactions between how we launch the executor, and the >> role of PID 1. >> >> The CommandInfo for the ExecutorInfo uses the default `shell` value which >> is `true`[1]. This means that in any PID isolated container the `sh` >> process that launches the executor will become PID 1. Here is an example >> `ps` output from vagrant showing this: >> ```` >> root@aurora:/# ps auxf >> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND >> root 250 0.0 0.0 21928 2124 ? Ss 01:19 0:00 /bin/bash >> root 469 0.0 0.0 19176 1240 ? R+ 01:28 0:00 \_ ps >> auxf >> root 1 0.0 0.0 4328 636 ? Ss 01:10 0:00 /bin/sh >> -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble >> localhost:2181 --announcer-zookeeper-auth-config >> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json >> --mesos-containerizer >> root 5 0.7 1.4 1201128 45604 ? Sl 01:10 0:08 >> python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble >> localhost:2181 --announcer-zookeeper-auth-config >> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json >> --mesos-containerizer- >> root 23 0.1 0.6 115668 20764 ? S 01:10 0:01 \_ >> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex >> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487 >> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js >> root 29 0.0 0.5 113476 17936 ? Ss 01:10 0:00 \_ >> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex >> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487 >> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo >> root 34 0.0 0.0 20040 1476 ? S 01:10 0:00 | >> \_ /bin/bash -c while true; do echo hello world sleep 10 >> done >> root 468 0.0 0.0 4228 348 ? S 01:28 0:00 | >> \_ sleep 10 >> root 31 0.0 0.5 113476 17936 ? Ss 01:10 0:00 \_ >> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex >> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487 >> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo >> root 32 0.0 0.0 20040 1476 ? S 01:10 0:00 >> \_ /bin/bash -c while true; do echo hello world sleep 10 >> done >> root 467 0.0 0.0 4228 352 ? S 01:28 0:00 >> \_ sleep 10 >> root 47 0.0 0.0 24116 3052 ? S 01:10 0:00 python >> ./daemon.py >> ```` >> >> This means processes that double fork/daemonize will be re parented to >> `sh` and not our executor. You can see that the `python daemon.py` process >> has been reparented to `sh` and not the executor and is outside of the >> scope of the runners. This has a number of undesirable implications, >> perhaps most concerning is that processes that end up reparenting to PID 1 >> will not receive SIGTERM or SIGKILL from thermos but instead will be killed >> by the kernel when thermos decides to to exit. If anyone here decides to >> run published images that use popular software that double forks (like >> nginx), you will never be able to ensure the processes die cleanly. >> >> I've been thinking about this problem for a while and upon advice from >> others and my own research I believe the best solution is as follows: >> 1. We have good reasons for setting `shell=True` when launching the >> executor. I'm not comfortable changing this because I'm not sure of all of >> the implications if we choose another method. >> 2. The thermos runners end up forking off the target processes. I think >> the runners should be responsible for all of the processes that are created >> by the children. >> 3. We can make the runners responsible for their grand children by using >> `prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner. >> This means double forked processes will be reparented to the runner and not >> PID 1 >> 4. On task tear down, we make the runners send SIGTERM and SIGKILL to the >> PIDs they recorded and any other children they have. >> 5. Each runner would need to have a SIGCHLD handler to handle zombie >> processes that are reparented to it. >> >> [1]: https://github.com/apache/aurora/blob/783baaefb9a814ca0 >> 1fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/sche >> duler/configuration/executor/ExecutorModule.java#L109-L135 >> [2]: http://man7.org/linux/man-pages/man2/prctl.2.html >> >> -- >> Zameer Manji >> >> -- >> Zameer Manji >> > -- Zameer Manji