Re: Review Request 54439: Add support for an mttu metric (median time to updated)

Joshua Cohen Fri, 09 Dec 2016 11:30:37 -0800


> On Dec. 7, 2016, 6:28 p.m., David McLaughlin wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java, lines 
> > 200-202
> > <https://reviews.apache.org/r/54439/diff/3/?file=1578834#file1578834line200>
> >
> >     Why do we only sample active updates, seems like we could miss data 
> > points? Especially for small updates.
> 
> Joshua Cohen wrote:
>     My thinking was that the vast majority of updates in the store will be 
> completed hours or days ago, so there's no need to consider them when 
> calculating the mttu. You're right, this does mean that we might lose some 
> data points for tasks that moved to `ASSIGNED` in the same 
> `SLA_REFRESH_INTERVAL` (defaults to one minute) in which the entire update 
> completed.
>     
>     For reference, some general stats from one of our clusters: currently at 
> off-peak hours, .02% of all updates in the update store are active. It's hard 
> to say with certainty, historically how many updates were active at any given 
> time. But anecdotatlly it's a small fraction of the total number of updates 
> in the store, generously speaking I'd say 1-2%. That being the case, by 
> including only active updates in the calculation, we reduce the work to be 
> done by anywhere from 98 to 99.98 percent.
>     
>     I feel like this is a fair trade off to make, but I'm not steadfast in 
> that opinion.
> 
> Santhosh Kumar Shanmugham wrote:
>     We can add a storage method that will give all the `InstanceUpdateEvent`s 
> during the last `SLA_REFRESH_INTERVAL` and use that to determine the 
> `activeUpdates` that will be looked into, this can give a much more accurate 
> value.
> 
> Joshua Cohen wrote:
>     I think that would filter out updates that are currently active but have 
> not have an instance event in the past `SLA_REFRESH_INTERVAL`. A trivial 
> example would be an update that processes batches of one instance where each 
> instance takes more than a minute to update.
> 
> Santhosh Kumar Shanmugham wrote:
>     I am talking about this part of the code.
>     
>                  .filter(taskEvent -> taskEvent.getStatus() == ASSIGNED
>                               && timeFrame.contains(taskEvent.getTimestamp()))
>                               
>     I think I misspoke about the event type, it is a `TaskEvent`.


I'm not sure I follow? The discussion above was related to whether we should be 
filtering completed updates, or whether we should iterate all updates in the 
store. Querying for only task events active in the last `SLA_REFRESH_INTERVAL` 
wouldn't be helpful for a few reasons:

1) We already have the full list of tasks in `MetricCalculator` for use in 
other SLA calculations.
2) By the time we iterate task events to find the `ASSIGNED` event, we've 
already iterated the update details.
2) The number of task events at this point will be small... maybe 5 or 6 tops.

Is there something I'm missing?


- Joshua


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54439/#review158363
-----------------------------------------------------------


On Dec. 8, 2016, 9:40 p.m., Joshua Cohen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54439/
> -----------------------------------------------------------
> 
> (Updated Dec. 8, 2016, 9:40 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Mehrdad Nurolahzade, and 
> Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The metric is calculated from the time of the `INSTANCE_UPDATING` event to 
> the subsequent `ASSIGNED` event for the task with the same instance id that 
> matches the desired task config from the update details.
> 
> My original approach to this involved converting `GroupType` and 
> `AlgorithmType` from enums (which cannot be generic) to static classes 
> (which, of course, can). This allowed me to avoid unnecessarily passing 
> update details to the `calculate` method of `SlaAlgorithm` since it's ignored 
> in all but the one, new case. However, that ended up being a lot of churn, 
> and since it turns out we need both the task details and the update details 
> to calculate this metric, I went with the below approach. If anyone feels 
> strongly, I could go back to generics and create an container class that's 
> gives access to both the tasks and update details.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java 
> 9a56cda809fbbcb07e6dd12c7a0feb272542491d 
>   src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java 
> 5d8d5bd8f705770979f284d26d2e932aabe707e5 
>   src/main/java/org/apache/aurora/scheduler/sla/SlaGroup.java 
> 6fbd4e962b3bb6eeb0831c810a321478fd52172c 
>   src/test/java/org/apache/aurora/scheduler/sla/MetricCalculatorTest.java 
> 953b65f28a585375e36e305dea6f9f94f99abc93 
>   src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java 
> 2e719ac6b7aea86faa22deff2cc6b5f73135761c 
>   src/test/java/org/apache/aurora/scheduler/sla/SlaModuleTest.java 
> 341e346e794c9cf9a2789b8799f38fff900ec9b3 
>   src/test/java/org/apache/aurora/scheduler/sla/SlaTestUtil.java 
> 78f440f7546de9ed6842cb51db02b3bddc9a74ff 
>   
> src/test/java/org/apache/aurora/scheduler/storage/testing/StorageTestUtil.java
>  21d26b3930ea965487b2dec48a48a98677ba022b 
>   src/test/java/org/apache/aurora/scheduler/thrift/Fixtures.java 
> 43e32eede27bbf26363a3fd1ca34ffe6f8c01a73 
>   
> src/test/java/org/apache/aurora/scheduler/thrift/ReadOnlySchedulerImplTest.java
>  6d0e9bc6a8040393875d4f0a88e8db9d6926a88b 
> 
> Diff: https://reviews.apache.org/r/54439/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests.
> 
> 
> Thanks,
> 
> Joshua Cohen
> 
>

Re: Review Request 54439: Add support for an mttu metric (median time to updated)

Reply via email to