[jira] [Created] (AURORA-1918) allow resource monitoring to be disabled in the executor

David Robinson (JIRA) Mon, 10 Apr 2017 17:09:06 -0700

David Robinson created AURORA-1918:
--------------------------------------

             Summary: allow resource monitoring to be disabled in the executor
                 Key: AURORA-1918
                 URL: https://issues.apache.org/jira/browse/AURORA-1918
             Project: Aurora
          Issue Type: Task
          Components: Executor
            Reporter: David Robinson
            Assignee: David Robinson



The Aurora executor monitors a [task's resource 
usage|https://github.com/apache/aurora/blob/cc2aa46f7ad8590e201621ffe2799299959ef7eb/src/main/python/apache/thermos/monitoring/resource.py#L15-L28]
 (CPU, memory and disk) and kills it [if its disk usage exceeds its 
reservation|https://github.com/apache/aurora/blob/cc2aa46f7ad8590e201621ffe2799299959ef7eb/src/main/python/apache/aurora/executor/common/resource_manager.py#L61-L67].

Monitoring disk usage is expensive, the executor does the equivalent of running 
'du' inside a container sandbox; it recursively walks the sandbox to calculate 
usage and in doing so effectively trashes the page cache. Within Twitter we've 
seen the executor consume an entire core while calculating disk usage -- a 
container with 500k files can reproduce the problem.

The executor also calculates process metrics, but the metrics are never used.

Mesos has a [posix disk 
isolator|https://github.com/apache/mesos/blob/master/docs/mesos-containerizer.md]
 (and XFS isolator) which provides the same functionality: it monitors disk 
usage and terminates a task if it exceeds its reservation.

Thermos Observer also monitors resource usage (see AURORA-1917), so disk usage 
is typically calculated 3 times -- once each by the executor, the observer, and 
mesos.

This could be solved by adding [--task_process_collection_interval_secs and 
--task_disk_collection_interval_secs 
flags|https://github.com/apache/aurora/commit/33acb899b8cbfd9914f028524cdd9428beeb06e3]
 to the executor, and if a zero interval is specified disabling resource 
collection.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (AURORA-1918) allow resource monitoring to be disabled in the executor

Reply via email to