-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53403/
-----------------------------------------------------------
Review request for Aurora, Joshua Cohen, Santhosh Kumar Shanmugham, and Stephan
Erb.
Bugs: AURORA-1808
https://issues.apache.org/jira/browse/AURORA-1808
Repository: aurora
Description
-------
This is a WIP patch showing a possible fix to AURORA-1808.
# Problem
Processes can deamonize and escape the supervision of a coordinator. Using the
Docker Containerizer or the Mesos Containerizer with pid isolation means that
the processes will be come reparented to the `sh` process that launches the
executor. For example:
````
root@aurora:/# ps xf
PID TTY STAT TIME COMMAND
48 ? Ss 0:00 /bin/bash
86 ? R+ 0:00 _ ps xf
1 ? Ss 0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex
--announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config
/home/vagrant/aurora/examples/va
5 ? Sl 0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex
--announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config
/home/vagrant/aurora/examples/vag
23 ? S 0:00 _ /usr/local/bin/python2.7
/mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be152
--
29 ? Ss 0:00 _ /usr/local/bin/python2.7
/mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15
32 ? S 0:00 | _ /bin/bash -c while true; do
echo hello world sleep 10 done
81 ? S 0:00 | _ sleep 10
31 ? Ss 0:00 _ /usr/local/bin/python2.7
/mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15
33 ? S 0:00 _ /bin/bash -c while true; do
echo hello world sleep 10 done
82 ? S 0:00 _ sleep 10
47 ? S 0:00 python ./daemon.py
````
# Solution
Ensure processes that escape the supervision of the coordinator reparent to the
runner who can send signals to them on task tear down.
After this change the process tree looks like:
````
root@aurora:/# ps xf
PID TTY STAT TIME COMMAND
66 ? Ss 0:00 /bin/bash
70 ? R+ 0:00 _ ps xf
1 ? Ss 0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex
--announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config
/home/vagrant/aurora/examples/va
5 ? Sl 0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex
--announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config
/home/vagrant/aurora/examples/vag
23 ? S 0:00 _ /usr/local/bin/python2.7
/mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b849
--
33 ? Ss 0:00 _ /usr/local/bin/python2.7
/mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84
40 ? S 0:00 | _ /bin/bash -c while true; do
echo hello world sleep 10 done
63 ? S 0:00 | _ sleep 10
36 ? Ss 0:00 _ /usr/local/bin/python2.7
/mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84
37 ? S 0:00 | _ /bin/bash -c while true; do
echo hello world sleep 10 done
62 ? S 0:00 | _ sleep 10
55 ? S 0:00 _ python ./daemon.py
````
Now the runner is aware of the reparented procesess can can tear it down
cleanly during teardown.
Diffs
-----
src/main/python/apache/thermos/common/process_util.py
abd2c0ef35858d13971319b0a7436ce2293824ce
src/main/python/apache/thermos/core/helper.py
68855e1e54ba1cd4456e18a36fb237ce6a468c34
src/main/python/apache/thermos/core/process.py
3ec43e2719ef97026f399c4b2aa23002559b3153
src/main/python/apache/thermos/core/runner.py
7b9013d11f6ff4172b6b7bf56e62299b0d11c977
Diff: https://reviews.apache.org/r/53403/diff/
Testing
-------
no automated tests yet.
Validated behaviour with `ps` and `strace`.
Thanks,
Zameer Manji