----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53403/ -----------------------------------------------------------
Review request for Aurora, Joshua Cohen, Santhosh Kumar Shanmugham, and Stephan Erb. Bugs: AURORA-1808 https://issues.apache.org/jira/browse/AURORA-1808 Repository: aurora Description ------- This is a WIP patch showing a possible fix to AURORA-1808. # Problem Processes can deamonize and escape the supervision of a coordinator. Using the Docker Containerizer or the Mesos Containerizer with pid isolation means that the processes will be come reparented to the `sh` process that launches the executor. For example: ```` root@aurora:/# ps xf PID TTY STAT TIME COMMAND 48 ? Ss 0:00 /bin/bash 86 ? R+ 0:00 _ ps xf 1 ? Ss 0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/va 5 ? Sl 0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/vag 23 ? S 0:00 _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be152 -- 29 ? Ss 0:00 _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15 32 ? S 0:00 | _ /bin/bash -c while true; do echo hello world sleep 10 done 81 ? S 0:00 | _ sleep 10 31 ? Ss 0:00 _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15 33 ? S 0:00 _ /bin/bash -c while true; do echo hello world sleep 10 done 82 ? S 0:00 _ sleep 10 47 ? S 0:00 python ./daemon.py ```` # Solution Ensure processes that escape the supervision of the coordinator reparent to the runner who can send signals to them on task tear down. After this change the process tree looks like: ```` root@aurora:/# ps xf PID TTY STAT TIME COMMAND 66 ? Ss 0:00 /bin/bash 70 ? R+ 0:00 _ ps xf 1 ? Ss 0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/va 5 ? Sl 0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/vag 23 ? S 0:00 _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b849 -- 33 ? Ss 0:00 _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84 40 ? S 0:00 | _ /bin/bash -c while true; do echo hello world sleep 10 done 63 ? S 0:00 | _ sleep 10 36 ? Ss 0:00 _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84 37 ? S 0:00 | _ /bin/bash -c while true; do echo hello world sleep 10 done 62 ? S 0:00 | _ sleep 10 55 ? S 0:00 _ python ./daemon.py ```` Now the runner is aware of the reparented procesess can can tear it down cleanly during teardown. Diffs ----- src/main/python/apache/thermos/common/process_util.py abd2c0ef35858d13971319b0a7436ce2293824ce src/main/python/apache/thermos/core/helper.py 68855e1e54ba1cd4456e18a36fb237ce6a468c34 src/main/python/apache/thermos/core/process.py 3ec43e2719ef97026f399c4b2aa23002559b3153 src/main/python/apache/thermos/core/runner.py 7b9013d11f6ff4172b6b7bf56e62299b0d11c977 Diff: https://reviews.apache.org/r/53403/diff/ Testing ------- no automated tests yet. Validated behaviour with `ps` and `strace`. Thanks, Zameer Manji