----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67627/ -----------------------------------------------------------
Review request for Aurora, Renan DelValle, Reza Motamedi, and Santhosh Kumar Shanmugham. Repository: aurora Description ------- Add observer command line option `--disable_task_resource_collection` to disable the collection of CPU, memory, and disk metrics for observed tasks. This is useful in setups where metrics cannot be gathered reliable (e.g. when using PID namespaces) or when it is expensive due to hundreds of active tasks per host. Diffs ----- RELEASE-NOTES.md edc081f502370190597ad028f3275cdfd572f5ca docs/reference/observer-configuration.md c791b3480e5bf35e6eb0fbea908ff3242eab315d src/main/python/apache/aurora/config/BUILD 12e7fe973f456d0847ce63d3b293131a7f4c3bdd src/main/python/apache/aurora/tools/thermos_observer.py fd9465d2e2b3135f3fdf8230777117adaa89337c src/main/python/apache/thermos/monitoring/resource.py 72ed4e5a82dfd8a09e0a8262f6da4992ac98542a src/main/python/apache/thermos/observer/task_observer.py 94cd6c541bb7f8a4c153cc51caa63d2c08888a49 src/test/python/apache/thermos/monitoring/test_resource.py 44450647a180f86903ebd37f2a9f4327496597e9 Diff: https://reviews.apache.org/r/67627/diff/1/ Testing ------- We are running our Mesos agents with enabled PID namespaces (i.e. `--isolation='namespaces/ipc,namespaces/pid,...'`). Sometimes the hosts are also tightly packed with many small tasks (e.g. `~130` active tasks and `~1000` finished tasks). Even with very relaxed scrape settings of `--task_process_collection_interval_secs=3000` and `--task_disk_collection_interval_secs=3000` it can take between `150ms-2500ms` to render the observer landing page `/main`. This patch reduces this to about `100ms-150ms`. There is no immediate downside as metrics reporting is broken anyway due to the PID namespacing. Thanks, Stephan Erb
