Repository: aurora Updated Branches: refs/heads/master 059b08621 -> 0c90c862a
Add MEDIAN_TIME_TO_STARTING as a new metric. A new MTTS (Median Time To Starting) metric is added to the sla module in addition to MTTA and MTTR. This review request is related to my previous review request: https://reviews.apache.org/r/51536 In the new implementation, the executor starts health check at STARTING, if a successful health check is performed before initial_interval_sec expires, it transitions into RUNNING state. Therefore, MTTS gives us an idea of how long it takes for a task to become active, whereas the difference between MTTR and MTTS represents the warm-up period for a task. See the following issues for more backgrounds: https://issues.apache.org/jira/browse/AURORA-1221 https://issues.apache.org/jira/browse/AURORA-1222 The new metrics represents the median time spent waiting for a set of tasks to reach STARTING status within a time frame(including the tasks turning into RUNNING state within the time frame). Here I regard STARTING as an active state. However, STARTING state is account for platform and job uptime calculations. Testing Done: ./gradlew build ./gradlew :test ./build-support/jenkins/build.sh Reviewed at https://reviews.apache.org/r/51580/ Project: http://git-wip-us.apache.org/repos/asf/aurora/repo Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/0c90c862 Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/0c90c862 Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/0c90c862 Branch: refs/heads/master Commit: 0c90c862a14c3a5efe0fdf0f30ee41c01b96b434 Parents: 059b086 Author: Kai Huang <[email protected]> Authored: Tue Sep 6 12:26:13 2016 -0700 Committer: Zameer Manji <[email protected]> Committed: Tue Sep 6 12:26:13 2016 -0700 ---------------------------------------------------------------------- docs/features/sla-metrics.md | 41 +++++++++++++- .../aurora/scheduler/sla/MetricCalculator.java | 2 + .../aurora/scheduler/sla/SlaAlgorithm.java | 2 + .../aurora/scheduler/sla/SlaAlgorithmTest.java | 57 ++++++++++++++++++++ 4 files changed, 100 insertions(+), 2 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/aurora/blob/0c90c862/docs/features/sla-metrics.md ---------------------------------------------------------------------- diff --git a/docs/features/sla-metrics.md b/docs/features/sla-metrics.md index 932b5dc..bca2ebf 100644 --- a/docs/features/sla-metrics.md +++ b/docs/features/sla-metrics.md @@ -6,6 +6,7 @@ Aurora SLA Measurement - [Platform Uptime](#platform-uptime) - [Job Uptime](#job-uptime) - [Median Time To Assigned (MTTA)](#median-time-to-assigned-\(mtta\)) + - [Median Time To Starting (MTTS)](#median-time-to-starting-\(mtts\)) - [Median Time To Running (MTTR)](#median-time-to-running-\(mttr\)) - [Limitations](#limitations) @@ -109,7 +110,7 @@ metric that helps track the dependency of scheduling performance on the requeste * Per job - `sla_<job_key>_mtta_ms` * Per cluster - `sla_cluster_mtta_ms` * Per instance size (small, medium, large, x-large, xx-large). Size are defined in: -[ResourceAggregates.java](../../src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java) +[ResourceBag.java](../../src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java) * By CPU: * `sla_cpu_small_mtta_ms` * `sla_cpu_medium_mtta_ms` @@ -135,6 +136,42 @@ MTTA only considers instances that have already reached ASSIGNED state and ignor that are still PENDING. This ensures straggler instances (e.g. with unreasonable resource constraints) do not affect metric curves. +### Median Time To Starting (MTTS) + +*Median time a job waits for its tasks to reach STARTING state. This is a comprehensive metric +reflecting on the overall time it takes for the Aurora/Mesos to start initializing the sandbox +for a task.* + +**Collection scope:** + +* Per job - `sla_<job_key>_mtts_ms` +* Per cluster - `sla_cluster_mtts_ms` +* Per instance size (small, medium, large, x-large, xx-large). Size are defined in: +[ResourceBag.java](../../src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java) + * By CPU: + * `sla_cpu_small_mtts_ms` + * `sla_cpu_medium_mtts_ms` + * `sla_cpu_large_mtts_ms` + * `sla_cpu_xlarge_mtts_ms` + * `sla_cpu_xxlarge_mtts_ms` + * By RAM: + * `sla_ram_small_mtts_ms` + * `sla_ram_medium_mtts_ms` + * `sla_ram_large_mtts_ms` + * `sla_ram_xlarge_mtts_ms` + * `sla_ram_xxlarge_mtts_ms` + * By DISK: + * `sla_disk_small_mtts_ms` + * `sla_disk_medium_mtts_ms` + * `sla_disk_large_mtts_ms` + * `sla_disk_xlarge_mtts_ms` + * `sla_disk_xxlarge_mtts_ms` + +**Units:** milliseconds + +MTTS only considers instances in STARTING state. This ensures straggler instances (e.g. with +unreasonable resource constraints) do not affect metric curves. + ### Median Time To Running (MTTR) *Median time a job waits for its tasks to reach RUNNING state. This is a comprehensive metric @@ -145,7 +182,7 @@ reflecting on the overall time it takes for the Aurora/Mesos to start executing * Per job - `sla_<job_key>_mttr_ms` * Per cluster - `sla_cluster_mttr_ms` * Per instance size (small, medium, large, x-large, xx-large). Size are defined in: -[ResourceAggregates.java](../../src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java) +[ResourceBag.java](../../src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java) * By CPU: * `sla_cpu_small_mttr_ms` * `sla_cpu_medium_mttr_ms` http://git-wip-us.apache.org/repos/asf/aurora/blob/0c90c862/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java ---------------------------------------------------------------------- diff --git a/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java b/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java index 3ddac8b..9a56cda 100644 --- a/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java +++ b/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java @@ -54,6 +54,7 @@ import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.JOB_UPT import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.JOB_UPTIME_99; import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.MEDIAN_TIME_TO_ASSIGNED; import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.MEDIAN_TIME_TO_RUNNING; +import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.MEDIAN_TIME_TO_STARTING; import static org.apache.aurora.scheduler.sla.SlaGroup.GroupType.CLUSTER; import static org.apache.aurora.scheduler.sla.SlaGroup.GroupType.JOB; import static org.apache.aurora.scheduler.sla.SlaGroup.GroupType.RESOURCE_CPU; @@ -88,6 +89,7 @@ class MetricCalculator implements Runnable { .build()), MEDIANS(ImmutableMultimap.<AlgorithmType, GroupType>builder() .putAll(MEDIAN_TIME_TO_ASSIGNED, JOB, CLUSTER, RESOURCE_CPU, RESOURCE_RAM, RESOURCE_DISK) + .putAll(MEDIAN_TIME_TO_STARTING, JOB, CLUSTER, RESOURCE_CPU, RESOURCE_RAM, RESOURCE_DISK) .putAll(MEDIAN_TIME_TO_RUNNING, JOB, CLUSTER, RESOURCE_CPU, RESOURCE_RAM, RESOURCE_DISK) .build()); http://git-wip-us.apache.org/repos/asf/aurora/blob/0c90c862/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java ---------------------------------------------------------------------- diff --git a/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java b/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java index 4f243aa..263647e 100644 --- a/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java +++ b/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java @@ -43,6 +43,7 @@ import static java.util.Objects.requireNonNull; import static org.apache.aurora.gen.ScheduleStatus.ASSIGNED; import static org.apache.aurora.gen.ScheduleStatus.PENDING; import static org.apache.aurora.gen.ScheduleStatus.RUNNING; +import static org.apache.aurora.gen.ScheduleStatus.STARTING; /** * Defines an SLA algorithm to be applied to a {@link IScheduledTask} @@ -72,6 +73,7 @@ interface SlaAlgorithm { JOB_UPTIME_50(new JobUptime(50f), String.format(JobUptime.NAME_FORMAT, 50f)), AGGREGATE_PLATFORM_UPTIME(new AggregatePlatformUptime(), "platform_uptime_percent"), MEDIAN_TIME_TO_ASSIGNED(new MedianAlgorithm(ASSIGNED), "mtta_ms"), + MEDIAN_TIME_TO_STARTING(new MedianAlgorithm(STARTING), "mtts_ms"), MEDIAN_TIME_TO_RUNNING(new MedianAlgorithm(RUNNING), "mttr_ms"); private final SlaAlgorithm algorithm; http://git-wip-us.apache.org/repos/asf/aurora/blob/0c90c862/src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java ---------------------------------------------------------------------- diff --git a/src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java b/src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java index 90ea3a1..eca1bee 100644 --- a/src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java +++ b/src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java @@ -43,6 +43,7 @@ import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.JOB_UPT import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.JOB_UPTIME_99; import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.MEDIAN_TIME_TO_ASSIGNED; import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.MEDIAN_TIME_TO_RUNNING; +import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.MEDIAN_TIME_TO_STARTING; import static org.junit.Assert.assertEquals; public class SlaAlgorithmTest { @@ -98,6 +99,62 @@ public class SlaAlgorithmTest { } @Test + public void testMedianTimeToStartingEven() { + Number actual = MEDIAN_TIME_TO_STARTING.getAlgorithm().calculate( + ImmutableSet.of( + makeTask(ImmutableMap.of(50L, PENDING)), // Ignored as not RUNNING + makeTask(ImmutableMap.of(50L, PENDING, 100L, ASSIGNED, 150L, STARTING)), + makeTask(ImmutableMap.of(100L, PENDING, 200L, ASSIGNED, 300L, STARTING, 400L, RUNNING)), + makeTask(ImmutableMap.of( + 100L, PENDING, + 200L, ASSIGNED, + 300L, STARTING, + 400L, KILLED)), // Ignored due to being terminal. + makeTask(ImmutableMap.of( + 50L, PENDING, + 100L, ASSIGNED, + 150L, STARTING, + 200L, RUNNING, + 300L, KILLED))), // Ignored due to being terminal. + Range.closedOpen(0L, 500L)); + assertEquals(100L, actual); + } + + @Test + public void testMedianTimeToStartingOdd() { + Number actual = MEDIAN_TIME_TO_STARTING.getAlgorithm().calculate( + ImmutableSet.of( + makeTask(ImmutableMap.of(50L, PENDING)), // Ignored as not RUNNING + makeTask(ImmutableMap.of(50L, PENDING, 100L, ASSIGNED, 150L, STARTING)), + makeTask(ImmutableMap.of(100L, PENDING, 200L, ASSIGNED, 300L, STARTING, 400L, RUNNING)), + makeTask(ImmutableMap.of(50L, PENDING, 100L, ASSIGNED, 350L, STARTING)), + makeTask(ImmutableMap.of( + 100L, PENDING, + 200L, ASSIGNED, + 300L, STARTING, + 400L, KILLED)), // Ignored due to being terminal. + makeTask(ImmutableMap.of( + 50L, PENDING, + 100L, ASSIGNED, + 150L, STARTING, + 200L, RUNNING, + 300L, KILLED))), // Ignored due to being terminal. + Range.closedOpen(0L, 500L)); + assertEquals(200L, actual); + } + + @Test + public void testMedianTimeToStartingZero() { + Number actual = MEDIAN_TIME_TO_STARTING.getAlgorithm().calculate( + ImmutableSet.of( + makeTask(ImmutableMap.of(50L, PENDING)), + makeTask(ImmutableMap.of(50L, PENDING, 100L, STARTING, 200L, RUNNING, 300L, KILLED)), + makeTask(ImmutableMap.of(50L, PENDING, 100L, STARTING, 200L, KILLED))), + Range.closedOpen(0L, 500L)); + assertEquals(0L, actual); + } + + @Test public void testMedianTimeToRunningEven() { Number actual = MEDIAN_TIME_TO_RUNNING.getAlgorithm().calculate( ImmutableSet.of(
