[
https://issues.apache.org/jira/browse/HBASE-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566669#comment-15566669
]
Gary Helmling commented on HBASE-16146:
---------------------------------------
Here are some microbenchmark results for posterity.
Benchmarking 32 threads updating a single Counter instance:
Counter with patch (removing ThreadLocal):
{noformat}
Result "testCounter":
N = 121837900
mean = 456.286 ±(99.9%) 3.249 ns/op
Percentiles, ns/op:
p(0.0000) = 45.000 ns/op
p(50.0000) = 232.000 ns/op
p(90.0000) = 1138.000 ns/op
p(95.0000) = 1600.000 ns/op
p(99.0000) = 2648.000 ns/op
p(99.9000) = 4456.000 ns/op
p(99.9900) = 11824.000 ns/op
p(99.9990) = 45903.487 ns/op
p(99.9999) = 1528139.979 ns/op
p(100.0000) = 31424512.000 ns/op
{noformat}
Counter with ThreadLocal:
{noformat}
Result "testCounterThreadLocal":
N = 104204449
mean = 412.524 ±(99.9%) 5.910 ns/op
Percentiles, ns/op:
p(0.0000) = 45.000 ns/op
p(50.0000) = 194.000 ns/op
p(90.0000) = 976.000 ns/op
p(95.0000) = 1404.000 ns/op
p(99.0000) = 2532.000 ns/op
p(99.9000) = 4448.000 ns/op
p(99.9900) = 11792.000 ns/op
p(99.9990) = 41655.456 ns/op
p(99.9999) = 4312849.000 ns/op
p(100.0000) = 105906176.000 ns/op
{noformat}
Comparison of implementations:
{noformat}
Benchmark Mode Cnt Score Error
Units
IncrementBenchmark.testAtomicLong sample 81080122 1880.701 ± 14.435
ns/op
IncrementBenchmark.testCounter sample 121837900 456.286 ± 3.249
ns/op
IncrementBenchmark.testCounterThreadLocal sample 104204449 412.524 ± 5.910
ns/op
IncrementBenchmark.testLongAdder sample 108712812 77.910 ± 1.070
ns/op
{noformat}
So, when operating on a single instance, the ThreadLocal version is a bit
faster.
However, when microbenchmarking FastLongHistogram using the two different
implementations, in a semi-realistic scenario which retains 500 histograms in
memory, randomly selecting 10 to update each call, with 200 threads, the cost
of the ThreadLocal becomes more clear:
FastLongHistogram with Counter with patch:
{noformat}
Result "fastLong":
N = 1373429925
mean = 48721.146 ±(99.9%) 196.908 ns/op
Percentiles, ns/op:
p(0.0000) = 2336.000 ns/op
p(50.0000) = 6664.000 ns/op
p(90.0000) = 7520.000 ns/op
p(95.0000) = 7784.000 ns/op
p(99.0000) = 8560.000 ns/op
p(99.9000) = 24288.000 ns/op
p(99.9900) = 94896128.000 ns/op
p(99.9990) = 153878528.000 ns/op
p(99.9999) = 654311424.000 ns/op
p(100.0000) = 2092957696.000 ns/op
{noformat}
FastLongHistogram with Counter with ThreadLocal:
{noformat}
Result "fastLongThreadLocal":
N = 1251201915
mean = 84227.741 ±(99.9%) 1114.037 ns/op
Percentiles, ns/op:
p(0.0000) = 4056.000 ns/op
p(50.0000) = 9760.000 ns/op
p(90.0000) = 12336.000 ns/op
p(95.0000) = 13648.000 ns/op
p(99.0000) = 16544.000 ns/op
p(99.9000) = 285696.000 ns/op
p(99.9900) = 111017984.000 ns/op
p(99.9990) = 172228608.000 ns/op
p(99.9999) = 4445962240.000 ns/op
p(100.0000) = 31742492672.000 ns/op
{noformat}
Result summary:
{noformat}
Benchmark Mode Cnt Score
Error Units
MultiHistogramBenchmark.fastLong sample 1373429925 48721.146
± 196.908 ns/op
MultiHistogramBenchmark.fastLongThreadLocal sample 1251201915 84227.741
± 1114.037 ns/op
MultiHistogramBenchmark.testHDRAtomic sample 1320949956 27066.038
± 177.677 ns/op
MultiHistogramBenchmark.testHDRConcurrent sample 1330869473 26586.309
± 170.456 ns/op
MultiHistogramBenchmark.testMutableTimeHistogram sample 1322279057 53766.021
± 238.439 ns/op
{noformat}
So with more Counters in memory and more threads, removing the ThreadLocal
usage results in a ~40% improvement, with up to an order of magnitude
improvement at upper percentiles.
We may still want to investigate using HDRHistogram, since its implementations
outperform both versions. But in the short term this should still be an
improvement.
> Counters are expensive...
> -------------------------
>
> Key: HBASE-16146
> URL: https://issues.apache.org/jira/browse/HBASE-16146
> Project: HBase
> Issue Type: Sub-task
> Reporter: stack
> Attachments: HBASE-16146.branch-1.3.001.patch, counters.patch,
> less_and_less_counters.png
>
>
> Doing workloadc, perf shows 10%+ of CPU being spent on counter#add. If I
> disable some of the hot ones -- see patch -- I can get 10% more throughput
> (390k to 440k). Figure something better.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)