[GitHub] merlimat opened a new pull request #1245: Collect Prometheus latency stats using DataSketches

GitBox Sat, 10 Mar 2018 12:26:39 -0800

merlimat opened a new pull request #1245: Collect Prometheus latency stats
using DataSketches
URL: https://github.com/apache/bookkeeper/pull/1245

The implementation for collecting and estimating the latency quantiles in
the Prometheus Java client library is very slow and it is impacting the the
bookie performance.

I have added a micro-benchmark that tests our various stats providers. These
tests are simulating 16 concurrent threads updating the stats.

#### Counter increment
```
Benchmark (statsProvider) Mode Cnt Score
Error Units
StatsLoggerBenchmark.counterIncrement Prometheus thrpt 3 391.882
? 786.987 ops/us
StatsLoggerBenchmark.counterIncrement Codahale thrpt 3 449.341
? 1337.736 ops/us
StatsLoggerBenchmark.counterIncrement Twitter thrpt 3 43.354
? 9.331 ops/us
StatsLoggerBenchmark.counterIncrement Ostrich thrpt 3 43.790
? 1.332 ops/us
```

Here prometheus is fast, though not as fast as a simple `LongAdder` which
can reach ~500M ops/sec.

#### Latency quantiles

```
Benchmark (statsProvider) Mode Cnt Score
Error Units
StatsLoggerBenchmark.recordLatency Prometheus thrpt 3 0.255
? 0.667 ops/us
StatsLoggerBenchmark.recordLatency Codahale thrpt 3 4.963
? 1.671 ops/us
StatsLoggerBenchmark.recordLatency Twitter thrpt 3 4.793
? 0.766 ops/us
StatsLoggerBenchmark.recordLatency Ostrich thrpt 3 2.473
? 6.394 ops/us
```

Here is where Prometheus is super-slow: 250K ops/second max, mostly due to
contention and GC pressure.

## Modification

I have re-adapted a stats collector I had done in the Yahoo branch:

https://github.com/yahoo/bookkeeper/tree/yahoo-4.3/bookkeeper-stats-providers/datasketches-metrics-provider/src/main/java/org/apache/bokkeeper/stats/datasketches

This is based on the [DataSketches](https://datasketches.github.io/) library
to have very fast and lightweight quantile estimates (along with a number of
other operations), plus some tricks to avoid concurrency issues by using thread
local collectors and aggregating when needed in background.

After the change, the throughput is 150x the original prometheus collector.

```
Benchmark (statsProvider) Mode Cnt Score
Error Units
StatsLoggerBenchmark.counterIncrement Prometheus thrpt 3 531.906
? 129.602 ops/us
StatsLoggerBenchmark.recordLatency Prometheus thrpt 3 27.538
? 5.893 ops/us
```

It is worth noting that the main bottle-neck in the `recordLatency` test is
now the `System.nanoTime()`
call used to pass different samples to the stat logger.

`System.nanoTime()` is not super fast:

```
Benchmark (statsProvider) Mode Cnt Score
Error Units
StatsLoggerBenchmark.currentTimeMillis N/A thrpt 3 161.502
? 267.238 ops/us
StatsLoggerBenchmark.nanoTime N/A thrpt 3 32.822
? 2.256 ops/us
```

By removing the `System.nanoTime()` call from the benchmark, the
Prometheus+DataSketches collector results in:

```
Benchmark (statsProvider) Mode Cnt Score
Error Units
StatsLoggerBenchmark.recordLatency Prometheus thrpt 3 108.542
? 31.848 ops/us
```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] merlimat opened a new pull request #1245: Collect Prometheus latency stats using DataSketches

Reply via email to