> On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote: > > RELEASE-NOTES.md, line 13 > > <https://reviews.apache.org/r/53418/diff/4/?file=1553877#file1553877line13> > > > > This should be elaborated a little bit more. It is confusing as Thermos > > processes have a `daemon` property.
Good catch, I have updated it to have more detail. > On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote: > > src/main/python/apache/thermos/common/process_util.py, lines 50-74 > > <https://reviews.apache.org/r/53418/diff/4/?file=1553878#file1553878line50> > > > > I would propose to wrap that entire method in a try-catch block so that > > this method is effectively no-throw. The advantages: > > > > * Thermos will still work on Kernels older than 3.4 (even though those > > would still be affected by the bug you have reported). > > * We minimize the risk of crashes due to bugs in this low level code. > > * We can provide propper logging here, as we have the necessary context > > to explain what is going on and that it is safe to continue. I've thought about this a lot, but kernel 3.4 was released in 2012: https://kernelnewbies.org/Linux_3.4 If folks are running a kernel older than that and running Mesos, I would be very surprised. For reference here are the kernels shipping with some very old currently supported distros: - Ubuntu 12.04 - Kernel 3.2 supported until April 2017 - Ubuntu 14.04 - Kernel 3.13 supported until April 2019 - CentOS/Red Hat 6 - Kernel Version 2.6.32. Released in 2010, supported until May 2017 - CentOS/Red Hat 7 - Kernel Version 3.10. Released in 2014, supported until 2020 Here are the support policies of related projects: - Mesos (from https://mesos.apache.org/gettingstarted/ ): "For full support of process isolation under Linux a recent kernel >=3.10 is required." - Docker (from https://docs.docker.com/engine/installation/binaries/ ): "A 3.10 Linux kernel is the minimum requirement for Docker." If you feel very strongly about this, please file a JIRA and we can discuss this in a follow up review before 0.17.0 is released. I don't strongly oppose this, but it seems like we are bending backwards to support a case that won't exist. > On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote: > > RELEASE-NOTES.md, line 14 > > <https://reviews.apache.org/r/53418/diff/4/?file=1553877#file1553877line14> > > > > We should try to make that optional. See below for details. Dropped per my argument below. > On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote: > > src/main/python/apache/thermos/core/process.py, lines 288-297 > > <https://reviews.apache.org/r/53418/diff/4/?file=1553880#file1553880line288> > > > > As mentioned above, I would propose to do the error handling in > > `setup_child_subreaping` directly. > > Joshua Cohen wrote: > I think this try/except is generally useful. It's something I've been > meaning to add for awhile, as I ran into the scenario frequenly while working > on filesystem isolation where the fork would fail and the task would go lost. Not doing this now per not making the related change for error handling in `setup_child_subreaping`. The try/accept is needed to ensure we send up the exepected exception so tasks fail with the right message and have the right logging. > On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote: > > src/test/sh/org/apache/aurora/e2e/test_daemonizing_process.aurora, line 38 > > <https://reviews.apache.org/r/53418/diff/4/?file=1553882#file1553882line38> > > > > Please add a small comment here for future readers. For example: > > "Assert that Thermos does not lose track of double forking processes. On > > task teardown the daemonized process should receive a signal to shut down > > cleanly." Done, good catch. - Zameer ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53418/#review154913 ----------------------------------------------------------- On Nov. 3, 2016, 4:48 p.m., Zameer Manji wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/53418/ > ----------------------------------------------------------- > > (Updated Nov. 3, 2016, 4:48 p.m.) > > > Review request for Aurora, Joshua Cohen and Stephan Erb. > > > Bugs: AURORA-1808 > https://issues.apache.org/jira/browse/AURORA-1808 > > > Repository: aurora > > > Description > ------- > > # Problem > > Processes can deamonize and escape the supervision of a coordinator. Using > the Docker Containerizer or the Mesos Containerizer with pid isolation means > that the processes will be come reparented to the sh process that launches > the executor. For example: > > ``` > root@aurora:/# ps xf > PID TTY STAT TIME COMMAND > 48 ? Ss 0:00 /bin/bash > 86 ? R+ 0:00 _ ps xf > 1 ? Ss 0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex > --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config > /home/vagrant/aurora/examples/va > 5 ? Sl 0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex > --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config > /home/vagrant/aurora/examples/vag > 23 ? S 0:00 _ /usr/local/bin/python2.7 > /mnt/mesos/sandbox/thermos_runner.pex > --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be152 > -- > 29 ? Ss 0:00 _ /usr/local/bin/python2.7 > /mnt/mesos/sandbox/thermos_runner.pex > --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15 > 32 ? S 0:00 | _ /bin/bash -c while true; do > echo hello world sleep 10 done > 81 ? S 0:00 | _ sleep 10 > 31 ? Ss 0:00 _ /usr/local/bin/python2.7 > /mnt/mesos/sandbox/thermos_runner.pex > --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15 > 33 ? S 0:00 _ /bin/bash -c while true; do > echo hello world sleep 10 done > 82 ? S 0:00 _ sleep 10 > 47 ? S 0:00 python ./daemon.py > ``` > > # Solution > > Ensure processes that escape the supervision of the coordinator reparent to > the runner who can send signals to them on task tear down. We do this by > using the `PR_SET_CHILD_SUBREAPER` flag of `prctl(2)`. > > After this change the process tree looks like: > ``` > root@aurora:/# ps xf > PID TTY STAT TIME COMMAND > 66 ? Ss 0:00 /bin/bash > 70 ? R+ 0:00 _ ps xf > 1 ? Ss 0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex > --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config > /home/vagrant/aurora/examples/va > 5 ? Sl 0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex > --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config > /home/vagrant/aurora/examples/vag > 23 ? S 0:00 _ /usr/local/bin/python2.7 > /mnt/mesos/sandbox/thermos_runner.pex > --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b849 > -- > 33 ? Ss 0:00 _ /usr/local/bin/python2.7 > /mnt/mesos/sandbox/thermos_runner.pex > --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84 > 40 ? S 0:00 | _ /bin/bash -c while true; do > echo hello world sleep 10 done > 63 ? S 0:00 | _ sleep 10 > 36 ? Ss 0:00 _ /usr/local/bin/python2.7 > /mnt/mesos/sandbox/thermos_runner.pex > --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84 > 37 ? S 0:00 | _ /bin/bash -c while true; do > echo hello world sleep 10 done > 62 ? S 0:00 | _ sleep 10 > 55 ? S 0:00 _ python ./daemon.py > > ``` > > Now the runner is aware of the reparented procesess can can tear it down > cleanly with a `SIGTERM`. > > > Diffs > ----- > > RELEASE-NOTES.md d89ef2f641373ac229be693a21a2c0111e1f241a > src/main/python/apache/thermos/common/process_util.py > abd2c0ef35858d13971319b0a7436ce2293824ce > src/main/python/apache/thermos/core/helper.py > 68855e1e54ba1cd4456e18a36fb237ce6a468c34 > src/main/python/apache/thermos/core/process.py > 3ec43e2719ef97026f399c4b2aa23002559b3153 > src/main/python/apache/thermos/core/runner.py > 7b9013d11f6ff4172b6b7bf56e62299b0d11c977 > src/test/sh/org/apache/aurora/e2e/test_daemonizing_process.aurora > PRE-CREATION > src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh > 67702d2c0f2e18ee10dcb798b6d421050bd7d4ca > > Diff: https://reviews.apache.org/r/53418/diff/ > > > Testing > ------- > > src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh > > > Thanks, > > Zameer Manji > >
