> On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote:
> > RELEASE-NOTES.md, line 13
> > <https://reviews.apache.org/r/53418/diff/4/?file=1553877#file1553877line13>
> >
> >     This should be elaborated a little bit more. It is confusing as Thermos 
> > processes have a `daemon` property.

Good catch, I have updated it to have more detail.


> On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote:
> > src/main/python/apache/thermos/common/process_util.py, lines 50-74
> > <https://reviews.apache.org/r/53418/diff/4/?file=1553878#file1553878line50>
> >
> >     I would propose to wrap that entire method in a try-catch block so that 
> > this method is effectively no-throw. The advantages:
> >     
> >     * Thermos will still work on Kernels older than 3.4 (even though those 
> > would still be affected by the bug you have reported).
> >     * We minimize the risk of crashes due to bugs in this low level code.
> >     * We can provide propper logging here, as we have the necessary context 
> > to explain what is going on and that it is safe to continue.

I've thought about this a lot, but kernel 3.4 was released in 2012: 
https://kernelnewbies.org/Linux_3.4

If folks are running a kernel older than that and running Mesos, I would be 
very surprised. For reference here are the kernels shipping with some very old 
currently supported distros:
- Ubuntu 12.04 - Kernel 3.2 supported until April 2017
- Ubuntu 14.04 - Kernel 3.13 supported until April 2019
- CentOS/Red Hat 6 - Kernel Version 2.6.32. Released in 2010, supported until 
May 2017
- CentOS/Red Hat 7 - Kernel Version 3.10. Released in 2014, supported until 2020

Here are the support policies of related projects:
- Mesos (from https://mesos.apache.org/gettingstarted/ ): "For full support of 
process isolation under Linux a recent kernel >=3.10 is required."
- Docker (from https://docs.docker.com/engine/installation/binaries/ ): "A 3.10 
Linux kernel is the minimum requirement for Docker."

If you feel very strongly about this, please file a JIRA and we can discuss 
this in a follow up review before 0.17.0 is released. I don't strongly oppose 
this, but it seems like we are bending backwards to support a case that won't 
exist.


> On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote:
> > RELEASE-NOTES.md, line 14
> > <https://reviews.apache.org/r/53418/diff/4/?file=1553877#file1553877line14>
> >
> >     We should try to make that optional. See below for details.

Dropped per my argument below.


> On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote:
> > src/main/python/apache/thermos/core/process.py, lines 288-297
> > <https://reviews.apache.org/r/53418/diff/4/?file=1553880#file1553880line288>
> >
> >     As mentioned above, I would propose to do the error handling in 
> > `setup_child_subreaping` directly.
> 
> Joshua Cohen wrote:
>     I think this try/except is generally useful. It's something I've been 
> meaning to add for awhile, as I ran into the scenario frequenly while working 
> on filesystem isolation where the fork would fail and the task would go lost.

Not doing this now per not making the related change for error handling in 
`setup_child_subreaping`.

The try/accept is needed to ensure we send up the exepected exception so tasks 
fail with the right message and have the right logging.


> On Nov. 4, 2016, 9:03 a.m., Stephan Erb wrote:
> > src/test/sh/org/apache/aurora/e2e/test_daemonizing_process.aurora, line 38
> > <https://reviews.apache.org/r/53418/diff/4/?file=1553882#file1553882line38>
> >
> >     Please add a small comment here for future readers. For example: 
> > "Assert that Thermos does not lose track of double forking processes. On 
> > task teardown the daemonized process should receive a signal to shut down 
> > cleanly."

Done, good catch.


- Zameer


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53418/#review154913
-----------------------------------------------------------


On Nov. 3, 2016, 4:48 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53418/
> -----------------------------------------------------------
> 
> (Updated Nov. 3, 2016, 4:48 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Stephan Erb.
> 
> 
> Bugs: AURORA-1808
>     https://issues.apache.org/jira/browse/AURORA-1808
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> # Problem
> 
> Processes can deamonize and escape the supervision of a coordinator. Using 
> the Docker Containerizer or the Mesos Containerizer with pid isolation means 
> that the processes will be come reparented to the sh process that launches 
> the executor. For example:
> 
> ```
> root@aurora:/# ps xf
>   PID TTY      STAT   TIME COMMAND
>    48 ?        Ss     0:00 /bin/bash
>    86 ?        R+     0:00  _ ps xf
>     1 ?        Ss     0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex 
> --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config 
> /home/vagrant/aurora/examples/va
>     5 ?        Sl     0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex 
> --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config 
> /home/vagrant/aurora/examples/vag
>    23 ?        S      0:00  _ /usr/local/bin/python2.7 
> /mnt/mesos/sandbox/thermos_runner.pex 
> --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be152
>  --
>    29 ?        Ss     0:00      _ /usr/local/bin/python2.7 
> /mnt/mesos/sandbox/thermos_runner.pex 
> --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15
>    32 ?        S      0:00      |   _ /bin/bash -c      while true; do       
> echo hello world       sleep 10     done
>    81 ?        S      0:00      |       _ sleep 10
>    31 ?        Ss     0:00      _ /usr/local/bin/python2.7 
> /mnt/mesos/sandbox/thermos_runner.pex 
> --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15
>    33 ?        S      0:00          _ /bin/bash -c      while true; do       
> echo hello world       sleep 10     done
>    82 ?        S      0:00              _ sleep 10
>    47 ?        S      0:00 python ./daemon.py
> ```
> 
> # Solution
> 
> Ensure processes that escape the supervision of the coordinator reparent to 
> the runner who can send signals to them on task tear down. We do this by 
> using the `PR_SET_CHILD_SUBREAPER` flag of `prctl(2)`.
> 
> After this change the process tree looks like:
> ```
> root@aurora:/# ps xf
>   PID TTY      STAT   TIME COMMAND
>    66 ?        Ss     0:00 /bin/bash
>    70 ?        R+     0:00  _ ps xf
>     1 ?        Ss     0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex 
> --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config 
> /home/vagrant/aurora/examples/va
>     5 ?        Sl     0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex 
> --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config 
> /home/vagrant/aurora/examples/vag
>    23 ?        S      0:00  _ /usr/local/bin/python2.7 
> /mnt/mesos/sandbox/thermos_runner.pex 
> --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b849
>  --
>    33 ?        Ss     0:00      _ /usr/local/bin/python2.7 
> /mnt/mesos/sandbox/thermos_runner.pex 
> --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84
>    40 ?        S      0:00      |   _ /bin/bash -c      while true; do       
> echo hello world       sleep 10     done
>    63 ?        S      0:00      |       _ sleep 10
>    36 ?        Ss     0:00      _ /usr/local/bin/python2.7 
> /mnt/mesos/sandbox/thermos_runner.pex 
> --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84
>    37 ?        S      0:00      |   _ /bin/bash -c      while true; do       
> echo hello world       sleep 10     done
>    62 ?        S      0:00      |       _ sleep 10
>    55 ?        S      0:00      _ python ./daemon.py
> 
> ```
> 
> Now the runner is aware of the reparented procesess can can tear it down 
> cleanly with a `SIGTERM`.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md d89ef2f641373ac229be693a21a2c0111e1f241a 
>   src/main/python/apache/thermos/common/process_util.py 
> abd2c0ef35858d13971319b0a7436ce2293824ce 
>   src/main/python/apache/thermos/core/helper.py 
> 68855e1e54ba1cd4456e18a36fb237ce6a468c34 
>   src/main/python/apache/thermos/core/process.py 
> 3ec43e2719ef97026f399c4b2aa23002559b3153 
>   src/main/python/apache/thermos/core/runner.py 
> 7b9013d11f6ff4172b6b7bf56e62299b0d11c977 
>   src/test/sh/org/apache/aurora/e2e/test_daemonizing_process.aurora 
> PRE-CREATION 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
> 67702d2c0f2e18ee10dcb798b6d421050bd7d4ca 
> 
> Diff: https://reviews.apache.org/r/53418/diff/
> 
> 
> Testing
> -------
> 
> src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>

Reply via email to