-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53403/#review154671
-----------------------------------------------------------



Overall this looks good to me.

I have no idea how to test this in practice other than to set up an e2e test 
that starts up mesos w/ pid isolation enabled, starting a job that double forks 
and then ensuring everything is torn down properly.


src/main/python/apache/thermos/common/process_util.py (line 72)
<https://reviews.apache.org/r/53403/#comment224299>

    Prefix this log statement so we have context.



src/main/python/apache/thermos/core/helper.py (lines 236 - 240)
<https://reviews.apache.org/r/53403/#comment224298>

    Something like:
    
        orphaned_pids = coordinator_pids - {c.pid for c in 
psutil.Process().children()}
    
    is maybe more pythonic (assuming I got my syntax right anyway...)



src/main/python/apache/thermos/core/helper.py (line 242)
<https://reviews.apache.org/r/53403/#comment224297>

    I think this would probably make sense as an info log.


- Joshua Cohen


On Nov. 2, 2016, 10:16 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53403/
> -----------------------------------------------------------
> 
> (Updated Nov. 2, 2016, 10:16 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1808
>     https://issues.apache.org/jira/browse/AURORA-1808
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is a WIP patch showing a possible fix to AURORA-1808.
> 
> # Problem
> 
> Processes can deamonize and escape the supervision of a coordinator. Using 
> the Docker Containerizer or the Mesos Containerizer with pid isolation means 
> that the processes will be come reparented to the `sh` process that launches 
> the executor. For example:
> ````
> root@aurora:/# ps xf
>   PID TTY      STAT   TIME COMMAND
>    48 ?        Ss     0:00 /bin/bash
>    86 ?        R+     0:00  _ ps xf
>     1 ?        Ss     0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex 
> --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config 
> /home/vagrant/aurora/examples/va
>     5 ?        Sl     0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex 
> --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config 
> /home/vagrant/aurora/examples/vag
>    23 ?        S      0:00  _ /usr/local/bin/python2.7 
> /mnt/mesos/sandbox/thermos_runner.pex 
> --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be152
>  --
>    29 ?        Ss     0:00      _ /usr/local/bin/python2.7 
> /mnt/mesos/sandbox/thermos_runner.pex 
> --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15
>    32 ?        S      0:00      |   _ /bin/bash -c      while true; do       
> echo hello world       sleep 10     done
>    81 ?        S      0:00      |       _ sleep 10
>    31 ?        Ss     0:00      _ /usr/local/bin/python2.7 
> /mnt/mesos/sandbox/thermos_runner.pex 
> --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15
>    33 ?        S      0:00          _ /bin/bash -c      while true; do       
> echo hello world       sleep 10     done
>    82 ?        S      0:00              _ sleep 10
>    47 ?        S      0:00 python ./daemon.py
> ````
> 
> # Solution
> Ensure processes that escape the supervision of the coordinator reparent to 
> the runner who can send signals to them on task tear down. We do this by 
> using the `PR_SET_CHILD_SUBREAPER` flag of `prctl(2)`.
> 
> After this change the process tree looks like:
> ````
> root@aurora:/# ps xf
>   PID TTY      STAT   TIME COMMAND
>    66 ?        Ss     0:00 /bin/bash
>    70 ?        R+     0:00  _ ps xf
>     1 ?        Ss     0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex 
> --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config 
> /home/vagrant/aurora/examples/va
>     5 ?        Sl     0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex 
> --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config 
> /home/vagrant/aurora/examples/vag
>    23 ?        S      0:00  _ /usr/local/bin/python2.7 
> /mnt/mesos/sandbox/thermos_runner.pex 
> --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b849
>  --
>    33 ?        Ss     0:00      _ /usr/local/bin/python2.7 
> /mnt/mesos/sandbox/thermos_runner.pex 
> --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84
>    40 ?        S      0:00      |   _ /bin/bash -c      while true; do       
> echo hello world       sleep 10     done
>    63 ?        S      0:00      |       _ sleep 10
>    36 ?        Ss     0:00      _ /usr/local/bin/python2.7 
> /mnt/mesos/sandbox/thermos_runner.pex 
> --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84
>    37 ?        S      0:00      |   _ /bin/bash -c      while true; do       
> echo hello world       sleep 10     done
>    62 ?        S      0:00      |       _ sleep 10
>    55 ?        S      0:00      _ python ./daemon.py
> ````
> 
> Now the runner is aware of the reparented procesess can can tear it down 
> cleanly during teardown.
> 
> Note that the man page for `prctl(2)` says that the processes that set 
> `PR_SET_CHILD_SUBREAPER` should reap children to get rid of zombies. It is 
> important to note tht the runner already does this in its run loop via 
> `TaskRunnerHelper.reap_children()`. This patch has the side effect of 
> ensuring it will reap all of the children launched via coordinators.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/thermos/common/process_util.py 
> abd2c0ef35858d13971319b0a7436ce2293824ce 
>   src/main/python/apache/thermos/core/helper.py 
> 68855e1e54ba1cd4456e18a36fb237ce6a468c34 
>   src/main/python/apache/thermos/core/process.py 
> 3ec43e2719ef97026f399c4b2aa23002559b3153 
>   src/main/python/apache/thermos/core/runner.py 
> 7b9013d11f6ff4172b6b7bf56e62299b0d11c977 
> 
> Diff: https://reviews.apache.org/r/53403/diff/
> 
> 
> Testing
> -------
> 
> no automated tests yet.
> 
> Validated behaviour with `ps` and `strace`.
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>

Reply via email to