-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/
-----------------------------------------------------------

Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, 
Santhosh Kumar Shanmugham, and Stephan Erb.


Repository: aurora


Description
-------

When disk isolation is enabled in a Mesos agent it calculates the disk usage 
for each container. 
Thermos Observer also monitors disk usage using `twitter.common.dirutil`, 
essentially repeating the work already done by the agent. In practice, we see 
that disk monitoring is one of the most expensive resource monitoring tasks. 
For instance, when there are deeply nested directories, the CPU utilization of 
the observer process can easily reach 1.5 CPUs. It would be ideal if we 
delegate the disk monitoring task to the agent and do it only once. With this 
approach, when disk collection has improved in the agent (for instance by 
implementing XFS isolation), we can simply benefit from it without any code 
change. Some more information about the problem is provided in AURORA-1918.

This patch that introduces `MesosDiskCollector` which queries the agent's API 
endpoint to lookup disk_used_bytes. Note that there is also resource monitoring 
in thermos executor. Currently, I left the disk collector there to use the `du` 
implementation. That can be changed in a later patch.

I modified some vagrant config files including `aurora-executor.service` and 
`etc_mesos-slave/isolation` for testing. They can be left as is. I included 
them in this patch to show how this would work e2e.


Diffs
-----

  3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
  examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
  examples/vagrant/mesos_config/etc_mesos-slave/isolation 
1a7028ffc70116b104ef3ad22b7388f637707a0f 
  examples/vagrant/systemd/aurora-executor.service 
5a1a9082ecd7b1367ec677d760a5c375b6db9076 
  src/main/python/apache/aurora/tools/thermos_observer.py 
dd9f0c46ceac9e939b1b763073314161de0ea614 
  src/main/python/apache/thermos/monitoring/BUILD 
65ba7088f65e7baa5d30744736ba456b46a55e86 
  src/main/python/apache/thermos/monitoring/disk.py 
52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
  src/main/python/apache/thermos/monitoring/resource.py 
f5e3849ca6682c6d4720698be869ca6b9f703b94 
  src/main/python/apache/thermos/observer/task_observer.py 
4bb5d239e81fe4659397f899760c0e8853e93786 
  
src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
 fe74bd1d36666ecd89fca1b5b2251202cbbc0f24 
  src/test/python/apache/thermos/monitoring/BUILD 
8f2b39336dce6c7b580e6ba0009f60afdcb89179 
  src/test/python/apache/thermos/monitoring/test_disk.py 
362393bfd1facf3198e2d438d0596b16700b72b8 


Diff: https://reviews.apache.org/r/66103/diff/1/


Testing
-------

I added unit tests.
Tested in vagrant and it works as intenced.
I also built and deployed in our test enviroment. In order to measure imporoved 
performance I created jobs with nested folders and noticed reduction in CPU 
utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU 
cores)


Thanks,

Reza Motamedi

Reply via email to