Re: Review Request 54439: Add support for an mttu metric (median time to updated)

Joshua Cohen Mon, 12 Dec 2016 10:12:09 -0800


> On Dec. 7, 2016, 6:28 p.m., David McLaughlin wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java, lines 
> > 200-202
> > <https://reviews.apache.org/r/54439/diff/3/?file=1578834#file1578834line200>
> >
> >     Why do we only sample active updates, seems like we could miss data 
> > points? Especially for small updates.
> 
> Joshua Cohen wrote:
>     My thinking was that the vast majority of updates in the store will be 
> completed hours or days ago, so there's no need to consider them when 
> calculating the mttu. You're right, this does mean that we might lose some 
> data points for tasks that moved to `ASSIGNED` in the same 
> `SLA_REFRESH_INTERVAL` (defaults to one minute) in which the entire update 
> completed.
>     
>     For reference, some general stats from one of our clusters: currently at 
> off-peak hours, .02% of all updates in the update store are active. It's hard 
> to say with certainty, historically how many updates were active at any given 
> time. But anecdotatlly it's a small fraction of the total number of updates 
> in the store, generously speaking I'd say 1-2%. That being the case, by 
> including only active updates in the calculation, we reduce the work to be 
> done by anywhere from 98 to 99.98 percent.
>     
>     I feel like this is a fair trade off to make, but I'm not steadfast in 
> that opinion.
> 
> Santhosh Kumar Shanmugham wrote:
>     We can add a storage method that will give all the `InstanceUpdateEvent`s 
> during the last `SLA_REFRESH_INTERVAL` and use that to determine the 
> `activeUpdates` that will be looked into, this can give a much more accurate 
> value.
> 
> Joshua Cohen wrote:
>     I think that would filter out updates that are currently active but have 
> not have an instance event in the past `SLA_REFRESH_INTERVAL`. A trivial 
> example would be an update that processes batches of one instance where each 
> instance takes more than a minute to update.
> 
> Santhosh Kumar Shanmugham wrote:
>     I am talking about this part of the code.
>     
>                  .filter(taskEvent -> taskEvent.getStatus() == ASSIGNED
>                               && timeFrame.contains(taskEvent.getTimestamp()))
>                               
>     I think I misspoke about the event type, it is a `TaskEvent`.
> 
> Joshua Cohen wrote:
>     I'm not sure I follow? The discussion above was related to whether we 
> should be filtering completed updates, or whether we should iterate all 
> updates in the store. Querying for only task events active in the last 
> `SLA_REFRESH_INTERVAL` wouldn't be helpful for a few reasons:
>     
>     1) We already have the full list of tasks in `MetricCalculator` for use 
> in other SLA calculations.
>     2) By the time we iterate task events to find the `ASSIGNED` event, we've 
> already iterated the update details.
>     2) The number of task events at this point will be small... maybe 5 or 6 
> tops.
>     
>     Is there something I'm missing?
> 
> Santhosh Kumar Shanmugham wrote:
>     I feel that the current logic has correctness issues, since it can miss 
> arbitrary number of data points and makes the metric unreliable. My 
> suggestion is to lead to the creation a metric that is reliable at the 
> expense of maybe more work.
> 
> David McLaughlin wrote:
>     Oh so I was not proposing querying all updates rather than just active 
> ones. Santhosh's suggestion was more in line with what I was thinking as an 
> alternative to avoid the data loss. We might not have storage support for 
> such a query right now (or maybe we do?), but could be worth adding to 
> support this?

I'm still not clear what you guys are asking for. Are you just saying rather 
than filtering out all non-active updates, we should instead filter out all 
updates that have been not been active with some timeframe?

If so, using just `SLA_REFRESH_INTERVAL` would not work, since that raises the 
issues in my original response to Santhosh (i.e. an update that is active, but 
where instances take a long time to transition to running would be missed). We 
could pick just pick some abritrary value to try and minimize the chances of 
that occurring, e.g. any update that has instance update events within the past 
10 minutes is included... This still has the chance of missing some data, 
though unlikely unless under a very specific set of circumstances.

I think the current impl is also not likely to miss much data though. Since 
we're timing from `UPDATING` to `ASSIGNED`,  that means the task would have to 
be assigned and the entire update would need to complete within the same 
`SLA_REFRESH_INTERVAL` for the data to be lost. Essentially this means either 
overriding the default `SLA_REFRESH_INTERVAL` to a longer value, or configuring 
an update such that all tasks can be killed, assigned and the update itself 
completes (i.e. no `watch_secs` or no health check -> running under the new 
logic) within a single interval.

There's currently no way to query based on recently updated, but we could add 
support for that to `JobUpdateQuery`. We'd need to add an index to 
`job_instance_update_events` as well, otherwise filtering by timestamp there 
would be unreasonably slow. If you feel this is warranted, I can add that 
support and change the query we use in `MetricCalculator`.

- Joshua

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54439/#review158363
-----------------------------------------------------------

On Dec. 8, 2016, 9:40 p.m., Joshua Cohen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54439/
> -----------------------------------------------------------
> 
> (Updated Dec. 8, 2016, 9:40 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Mehrdad Nurolahzade, and 
> Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> The metric is calculated from the time of the `INSTANCE_UPDATING` event to 
> the subsequent `ASSIGNED` event for the task with the same instance id that 
> matches the desired task config from the update details.
> 
> My original approach to this involved converting `GroupType` and 
> `AlgorithmType` from enums (which cannot be generic) to static classes 
> (which, of course, can). This allowed me to avoid unnecessarily passing 
> update details to the `calculate` method of `SlaAlgorithm` since it's ignored 
> in all but the one, new case. However, that ended up being a lot of churn, 
> and since it turns out we need both the task details and the update details 
> to calculate this metric, I went with the below approach. If anyone feels 
> strongly, I could go back to generics and create an container class that's 
> gives access to both the tasks and update details.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java 
> 9a56cda809fbbcb07e6dd12c7a0feb272542491d 
>   src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java 
> 5d8d5bd8f705770979f284d26d2e932aabe707e5 
>   src/main/java/org/apache/aurora/scheduler/sla/SlaGroup.java 
> 6fbd4e962b3bb6eeb0831c810a321478fd52172c 
>   src/test/java/org/apache/aurora/scheduler/sla/MetricCalculatorTest.java 
> 953b65f28a585375e36e305dea6f9f94f99abc93 
>   src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java 
> 2e719ac6b7aea86faa22deff2cc6b5f73135761c 
>   src/test/java/org/apache/aurora/scheduler/sla/SlaModuleTest.java 
> 341e346e794c9cf9a2789b8799f38fff900ec9b3 
>   src/test/java/org/apache/aurora/scheduler/sla/SlaTestUtil.java 
> 78f440f7546de9ed6842cb51db02b3bddc9a74ff 
>   
> src/test/java/org/apache/aurora/scheduler/storage/testing/StorageTestUtil.java
>  21d26b3930ea965487b2dec48a48a98677ba022b 
>   src/test/java/org/apache/aurora/scheduler/thrift/Fixtures.java 
> 43e32eede27bbf26363a3fd1ca34ffe6f8c01a73 
>   
> src/test/java/org/apache/aurora/scheduler/thrift/ReadOnlySchedulerImplTest.java
>  6d0e9bc6a8040393875d4f0a88e8db9d6926a88b 
> 
> Diff: https://reviews.apache.org/r/54439/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew build -Pq
> e2e tests.
> 
> 
> Thanks,
> 
> Joshua Cohen
> 
>

Re: Review Request 54439: Add support for an mttu metric (median time to updated)

Reply via email to