> On April 16, 2014, 11:57 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java, line 73
> > <https://reviews.apache.org/r/20398/diff/1/?file=560703#file560703line73>
> >
> >     I would find median really useful too.

By median you mean JOB_UPTIME_50?


> On April 16, 2014, 11:57 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java, line 
> > 107
> > <https://reviews.apache.org/r/20398/diff/1/?file=560703#file560703line107>
> >
> >     You probably want to initialize stats with a non-default value, to 
> > avoid transient zeros.

Not sure I understand what default value you mean here. The gauge is created 
right before its value is set (line 155). Sure, there is a brief moment its 
value is going to be zero but that would be a few nanos in the worst case. What 
am I missing?


> On April 16, 2014, 11:57 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java, line 
> > 149
> > <https://reviews.apache.org/r/20398/diff/1/?file=560703#file560703line149>
> >
> >     Have you considered inverting this, to pass the algorithm 
> > implementations a reference to the metric cache (possibly in a scoped way 
> > so they are guaranteed to not collide)?  This would help you avoid the 
> > return value gymnastics to support things like multi-valued outputs.

That was my first idea but I decided to keep calculation separate for a few 
reasons:
- Decoupling of algorithms, groups and stats creates a clear recipe for 
constructing any desired combinations of algorithms/groups without the need for 
special casing (e.g. multi-valued outputs);
- Clear separation of concerns without mixing algorithmic and stat exporting 
parts yields an easier to read/test/extend design.

Sure, there is some performance price to pay for this modularity but I am not 
convinced it's substantial enough to compromise on design here.  


> On April 16, 2014, 11:57 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 70
> > <https://reviews.apache.org/r/20398/diff/1/?file=560704#file560704line70>
> >
> >     If you take the advice above to invert the calls (pass the SlaAlgorithm 
> > a thing through which it can update metrics), you can simplify the uptime 
> > metric computation and require fewer passes over all tasks.  You end up 
> > with a single JobUptime instance that filters, sorts, and picks items at 
> > the desired percentiles.

Well, that's exactly what I wanted to avoid here - making some algorithms 
smarter than others and tackling more than one metric at a time. Collapsing 
multiple percentiles into one algorithm would actually make code slightly more 
complex as it would have to handle multiple metrics in one invocation. Besides, 
it would make it harder to expand specific percentiles to other groups (say 75 
percentile grouped by failureCount or the presence of ancestorId). Given this 
and design points I mentioned above, is there any other compelling reason 
besides performance to go this route?


> On April 16, 2014, 11:57 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 94
> > <https://reviews.apache.org/r/20398/diff/1/?file=560704#file560704line94>
> >
> >     Seems like it would be fairly straightforward to retrofit this for 
> > median as opposed to mean.  How would you feel about doing that?

Need to think if there are any downsides to this but off the top I think it 
would make sense.


> On April 16, 2014, 11:57 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaModule.java, line 61
> > <https://reviews.apache.org/r/20398/diff/1/?file=560707#file560707line61>
> >
> >     The triggering of this fits in nicely with AsyncStatsModule.  Have you 
> > considered kicking off SLA computation there?

Thought about that but decided in favor of keeping things in a separate package 
to avoid making everything public. I am actually considering making it a 
PrivateModule to further limit the type exposure.


> On April 16, 2014, 11:57 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java, line 68
> > <https://reviews.apache.org/r/20398/diff/1/?file=560704#file560704line68>
> >
> >     Is this used other than to just provide a name prefix?  Grouping them 
> > with an enum doesn't seem to be offering much over something like a List.

This is mostly to have a prescriptive enum-only way to create algorithm 
instances following up on the idea of having a consistent way of composing 
algorithms/groups in any desirable fashion.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20398/#review40606
-----------------------------------------------------------


On April 16, 2014, 1:39 a.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20398/
> -----------------------------------------------------------
> 
> (Updated April 16, 2014, 1:39 a.m.)
> 
> 
> Review request for Aurora, Kevin Sweeney, Bill Farner, and Brian Wickman.
> 
> 
> Bugs: AURORA-293
>     https://issues.apache.org/jira/browse/AURORA-293
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> High level overview:
> - MetricCalculator runs periodically (every minute), pulls all task history, 
> packages it into SlaInstance list and updates stats;
> - Stat calculation is handled by a pair of: SlaAlgorithm and a set of 
> applicable SlaGroups (logical groupings by job, cluster, resource and etc.);
> - Stat name is generated by combining group and algorithm name parts.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/app/AppModule.java 
> eeafc784e915137cacd5f64df1252ccbaf6c0f6c 
>   src/main/java/org/apache/aurora/scheduler/async/RescheduleCalculator.java 
> bc77bf2e6fbc1ff4159d049b0c28afd6832499ef 
>   src/main/java/org/apache/aurora/scheduler/base/Tasks.java 
> fae2d237d18f945d4dd73ec56cd42981359bea46 
>   src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java 
> PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java 
> PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/sla/SlaGroup.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/sla/SlaInstance.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/sla/SlaModule.java PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/sla/MetricCalculatorTest.java 
> PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java 
> PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/sla/SlaTestUtil.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/20398/diff/
> 
> 
> Testing
> -------
> 
> gradle build
> gradle run
> 
> Sample from local scheduler:
> 
> sla_cluster_mtta 3271.493
> sla_cluster_mttr 3273.497
> sla_cluster_platform_uptime_percent 100.0
> sla_cpu_small_mtta 3271.493
> sla_cpu_small_mttr 3273.497
> sla_disk_small_mtta 3271.493
> sla_disk_small_mttr 3273.497
> sla_mesos_test_serviceJob0_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob0_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob0_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob0_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob0_mtta 0.0
> sla_mesos_test_serviceJob0_mttr 0.0
> sla_mesos_test_serviceJob0_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob10_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob10_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob10_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob10_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob10_mtta 0.0
> sla_mesos_test_serviceJob10_mttr 0.0
> sla_mesos_test_serviceJob10_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob12_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob12_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob12_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob12_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob12_mtta 0.0
> sla_mesos_test_serviceJob12_mttr 0.0
> sla_mesos_test_serviceJob12_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob14_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob14_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob14_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob14_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob14_mtta 0.0
> sla_mesos_test_serviceJob14_mttr 0.0
> sla_mesos_test_serviceJob14_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob16_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob16_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob16_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob16_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob16_mtta 0.0
> sla_mesos_test_serviceJob16_mttr 0.0
> sla_mesos_test_serviceJob16_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob18_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob18_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob18_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob18_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob18_mtta 0.0
> sla_mesos_test_serviceJob18_mttr 0.0
> sla_mesos_test_serviceJob18_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob2_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob2_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob2_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob2_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob2_mtta 0.0
> sla_mesos_test_serviceJob2_mttr 0.0
> sla_mesos_test_serviceJob2_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob4_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob4_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob4_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob4_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob4_mtta 0.0
> sla_mesos_test_serviceJob4_mttr 0.0
> sla_mesos_test_serviceJob4_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob6_job_uptime_75_00_sec 7125
> sla_mesos_test_serviceJob6_job_uptime_90_00_sec 7125
> sla_mesos_test_serviceJob6_job_uptime_95_00_sec 7125
> sla_mesos_test_serviceJob6_job_uptime_99_00_sec 7125
> sla_mesos_test_serviceJob6_mtta 3271.493
> sla_mesos_test_serviceJob6_mttr 3273.497
> sla_mesos_test_serviceJob6_platform_uptime_percent 100.0
> sla_mesos_test_serviceJob8_job_uptime_75_00_sec 0
> sla_mesos_test_serviceJob8_job_uptime_90_00_sec 0
> sla_mesos_test_serviceJob8_job_uptime_95_00_sec 0
> sla_mesos_test_serviceJob8_job_uptime_99_00_sec 0
> sla_mesos_test_serviceJob8_mtta 0.0
> sla_mesos_test_serviceJob8_mttr 0.0
> sla_mesos_test_serviceJob8_platform_uptime_percent 100.0
> sla_ram_small_mtta 3271.493
> sla_ram_small_mttr 3273.497
> 
> 
> File Attachments
> ----------------
> 
> Coverage report
>   
> https://reviews.apache.org/media/uploaded/files/2014/04/16/ffe00c63-bb3a-4b90-95f8-f23878b0fdab__SLA_coverage_report.png
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>

Reply via email to