himanshug commented on issue #8071: add aggregators for computing mean/average URL: https://github.com/apache/incubator-druid/issues/8071#issuecomment-514679521 I ran following benchmark on an idle `t3.medium` EC2 instance. ``` @State(Scope.Benchmark) @Fork(1) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Warmup(iterations = 5) @Measurement(iterations = 5) public class MeanAggregationBenchmark { @Param({"10", "100", "1000", "10000", "100000", "1000000"}) private int n; @Benchmark public void sumcountalgo(Blackhole blackhole) { double sum = 0; long count = 0; for (int i = 1; i <= n; i++) { count++; sum += i; } double mean = sum/count; blackhole.consume(mean); } @Benchmark public void divbasedalgo(Blackhole blackhole) { double mean = 0; long count = 0; for (int i = 1; i <= n; i++) { count++; mean = mean + (i - mean)/count; } blackhole.consume(mean); } } ``` that produced ``` Benchmark (n) Mode Cnt Score Error Units MeanAggregationBenchmark.divbasedalgo 10 avgt 5 30.601 ± 0.123 ns/op MeanAggregationBenchmark.divbasedalgo 100 avgt 5 613.996 ± 4.428 ns/op MeanAggregationBenchmark.divbasedalgo 1000 avgt 5 7130.145 ± 90.105 ns/op MeanAggregationBenchmark.divbasedalgo 10000 avgt 5 71296.475 ± 627.211 ns/op MeanAggregationBenchmark.divbasedalgo 100000 avgt 5 712208.125 ± 5066.485 ns/op MeanAggregationBenchmark.divbasedalgo 1000000 avgt 5 7117593.058 ± 78224.014 ns/op MeanAggregationBenchmark.sumcountalgo 10 avgt 5 8.600 ± 0.106 ns/op MeanAggregationBenchmark.sumcountalgo 100 avgt 5 99.377 ± 1.028 ns/op MeanAggregationBenchmark.sumcountalgo 1000 avgt 5 1282.228 ± 29.673 ns/op MeanAggregationBenchmark.sumcountalgo 10000 avgt 5 12790.421 ± 127.669 ns/op MeanAggregationBenchmark.sumcountalgo 100000 avgt 5 128994.367 ± 2068.910 ns/op MeanAggregationBenchmark.sumcountalgo 1000000 avgt 5 1281204.111 ± 12692.807 ns/op ``` From that it is clear that "div based" algo is about 6 times slower compared to "sum based" . With div based algo, it is about ~7ms for 1mn aggregations compared to ~1ms for sum based , and that might be make a difference for some users but not most. I think, for the mean aggregator introduced here, we can have both algos be present in the code with `div based` being the default but ability to switch to `sum based` if need be.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
