[
https://issues.apache.org/jira/browse/AURORA-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reza Motamedi reassigned AURORA-1918:
-------------------------------------
Assignee: Reza Motamedi (was: David Robinson)
> allow resource monitoring to be disabled in the executor
> --------------------------------------------------------
>
> Key: AURORA-1918
> URL: https://issues.apache.org/jira/browse/AURORA-1918
> Project: Aurora
> Issue Type: Task
> Components: Executor
> Reporter: David Robinson
> Assignee: Reza Motamedi
>
> The Aurora executor monitors a [task's resource
> usage|https://github.com/apache/aurora/blob/cc2aa46f7ad8590e201621ffe2799299959ef7eb/src/main/python/apache/thermos/monitoring/resource.py#L15-L28]
> (CPU, memory and disk) and kills it [if its disk usage exceeds its
> reservation|https://github.com/apache/aurora/blob/cc2aa46f7ad8590e201621ffe2799299959ef7eb/src/main/python/apache/aurora/executor/common/resource_manager.py#L61-L67].
> Monitoring disk usage is expensive, the executor does the equivalent of
> running 'du' inside a container sandbox; it recursively walks the sandbox to
> calculate usage and in doing so effectively trashes the page cache. Within
> Twitter we've seen the executor consume an entire core while calculating disk
> usage -- a container with 500k files can reproduce the problem.
> The executor also calculates process metrics, but the metrics are never used.
> Mesos has a [posix disk
> isolator|https://github.com/apache/mesos/blob/master/docs/mesos-containerizer.md]
> (and XFS isolator) which provides the same functionality: it monitors disk
> usage and terminates a task if it exceeds its reservation.
> Thermos Observer also monitors resource usage (see AURORA-1917), so disk
> usage is typically calculated 3 times -- once each by the executor, the
> observer, and mesos.
> This could be solved by adding [--task_process_collection_interval_secs and
> --task_disk_collection_interval_secs
> flags|https://github.com/apache/aurora/commit/33acb899b8cbfd9914f028524cdd9428beeb06e3]
> to the executor, and if a zero interval is specified disabling resource
> collection.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)