[ https://issues.apache.org/jira/browse/HBASE-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566669#comment-15566669 ]
Gary Helmling commented on HBASE-16146: --------------------------------------- Here are some microbenchmark results for posterity. Benchmarking 32 threads updating a single Counter instance: Counter with patch (removing ThreadLocal): {noformat} Result "testCounter": N = 121837900 mean = 456.286 ±(99.9%) 3.249 ns/op Percentiles, ns/op: p(0.0000) = 45.000 ns/op p(50.0000) = 232.000 ns/op p(90.0000) = 1138.000 ns/op p(95.0000) = 1600.000 ns/op p(99.0000) = 2648.000 ns/op p(99.9000) = 4456.000 ns/op p(99.9900) = 11824.000 ns/op p(99.9990) = 45903.487 ns/op p(99.9999) = 1528139.979 ns/op p(100.0000) = 31424512.000 ns/op {noformat} Counter with ThreadLocal: {noformat} Result "testCounterThreadLocal": N = 104204449 mean = 412.524 ±(99.9%) 5.910 ns/op Percentiles, ns/op: p(0.0000) = 45.000 ns/op p(50.0000) = 194.000 ns/op p(90.0000) = 976.000 ns/op p(95.0000) = 1404.000 ns/op p(99.0000) = 2532.000 ns/op p(99.9000) = 4448.000 ns/op p(99.9900) = 11792.000 ns/op p(99.9990) = 41655.456 ns/op p(99.9999) = 4312849.000 ns/op p(100.0000) = 105906176.000 ns/op {noformat} Comparison of implementations: {noformat} Benchmark Mode Cnt Score Error Units IncrementBenchmark.testAtomicLong sample 81080122 1880.701 ± 14.435 ns/op IncrementBenchmark.testCounter sample 121837900 456.286 ± 3.249 ns/op IncrementBenchmark.testCounterThreadLocal sample 104204449 412.524 ± 5.910 ns/op IncrementBenchmark.testLongAdder sample 108712812 77.910 ± 1.070 ns/op {noformat} So, when operating on a single instance, the ThreadLocal version is a bit faster. However, when microbenchmarking FastLongHistogram using the two different implementations, in a semi-realistic scenario which retains 500 histograms in memory, randomly selecting 10 to update each call, with 200 threads, the cost of the ThreadLocal becomes more clear: FastLongHistogram with Counter with patch: {noformat} Result "fastLong": N = 1373429925 mean = 48721.146 ±(99.9%) 196.908 ns/op Percentiles, ns/op: p(0.0000) = 2336.000 ns/op p(50.0000) = 6664.000 ns/op p(90.0000) = 7520.000 ns/op p(95.0000) = 7784.000 ns/op p(99.0000) = 8560.000 ns/op p(99.9000) = 24288.000 ns/op p(99.9900) = 94896128.000 ns/op p(99.9990) = 153878528.000 ns/op p(99.9999) = 654311424.000 ns/op p(100.0000) = 2092957696.000 ns/op {noformat} FastLongHistogram with Counter with ThreadLocal: {noformat} Result "fastLongThreadLocal": N = 1251201915 mean = 84227.741 ±(99.9%) 1114.037 ns/op Percentiles, ns/op: p(0.0000) = 4056.000 ns/op p(50.0000) = 9760.000 ns/op p(90.0000) = 12336.000 ns/op p(95.0000) = 13648.000 ns/op p(99.0000) = 16544.000 ns/op p(99.9000) = 285696.000 ns/op p(99.9900) = 111017984.000 ns/op p(99.9990) = 172228608.000 ns/op p(99.9999) = 4445962240.000 ns/op p(100.0000) = 31742492672.000 ns/op {noformat} Result summary: {noformat} Benchmark Mode Cnt Score Error Units MultiHistogramBenchmark.fastLong sample 1373429925 48721.146 ± 196.908 ns/op MultiHistogramBenchmark.fastLongThreadLocal sample 1251201915 84227.741 ± 1114.037 ns/op MultiHistogramBenchmark.testHDRAtomic sample 1320949956 27066.038 ± 177.677 ns/op MultiHistogramBenchmark.testHDRConcurrent sample 1330869473 26586.309 ± 170.456 ns/op MultiHistogramBenchmark.testMutableTimeHistogram sample 1322279057 53766.021 ± 238.439 ns/op {noformat} So with more Counters in memory and more threads, removing the ThreadLocal usage results in a ~40% improvement, with up to an order of magnitude improvement at upper percentiles. We may still want to investigate using HDRHistogram, since its implementations outperform both versions. But in the short term this should still be an improvement. > Counters are expensive... > ------------------------- > > Key: HBASE-16146 > URL: https://issues.apache.org/jira/browse/HBASE-16146 > Project: HBase > Issue Type: Sub-task > Reporter: stack > Attachments: HBASE-16146.branch-1.3.001.patch, counters.patch, > less_and_less_counters.png > > > Doing workloadc, perf shows 10%+ of CPU being spent on counter#add. If I > disable some of the hot ones -- see patch -- I can get 10% more throughput > (390k to 440k). Figure something better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)