[jira] [Commented] (HBASE-16146) Counters are expensive...

Gary Helmling (JIRA) Tue, 11 Oct 2016 14:27:05 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566669#comment-15566669
 ]


Gary Helmling commented on HBASE-16146:
---------------------------------------

Here are some microbenchmark results for posterity.

Benchmarking 32 threads updating a single Counter instance:

Counter with patch (removing ThreadLocal):
{noformat}
Result "testCounter":
  N = 121837900
  mean =    456.286 ±(99.9%) 3.249 ns/op

  Percentiles, ns/op:
      p(0.0000) =     45.000 ns/op
     p(50.0000) =    232.000 ns/op
     p(90.0000) =   1138.000 ns/op
     p(95.0000) =   1600.000 ns/op
     p(99.0000) =   2648.000 ns/op
     p(99.9000) =   4456.000 ns/op
     p(99.9900) =  11824.000 ns/op
     p(99.9990) =  45903.487 ns/op
     p(99.9999) = 1528139.979 ns/op
    p(100.0000) = 31424512.000 ns/op
{noformat}

Counter with ThreadLocal:
{noformat}
Result "testCounterThreadLocal":
  N = 104204449
  mean =    412.524 ±(99.9%) 5.910 ns/op

  Percentiles, ns/op:
      p(0.0000) =     45.000 ns/op
     p(50.0000) =    194.000 ns/op
     p(90.0000) =    976.000 ns/op
     p(95.0000) =   1404.000 ns/op
     p(99.0000) =   2532.000 ns/op
     p(99.9000) =   4448.000 ns/op
     p(99.9900) =  11792.000 ns/op
     p(99.9990) =  41655.456 ns/op
     p(99.9999) = 4312849.000 ns/op
    p(100.0000) = 105906176.000 ns/op
{noformat}

Comparison of implementations:
{noformat}
Benchmark                                    Mode        Cnt     Score    Error 
 Units
IncrementBenchmark.testAtomicLong          sample   81080122  1880.701 ± 14.435 
 ns/op
IncrementBenchmark.testCounter             sample  121837900   456.286 ±  3.249 
 ns/op
IncrementBenchmark.testCounterThreadLocal  sample  104204449   412.524 ±  5.910 
 ns/op
IncrementBenchmark.testLongAdder           sample  108712812    77.910 ±  1.070 
 ns/op
{noformat}

So, when operating on a single instance, the ThreadLocal version is a bit 
faster.

However, when microbenchmarking FastLongHistogram using the two different 
implementations, in a semi-realistic scenario which retains 500 histograms in 
memory, randomly selecting 10 to update each call, with 200 threads, the cost 
of the ThreadLocal becomes more clear:

FastLongHistogram with Counter with patch:
{noformat}
Result "fastLong":
  N = 1373429925
  mean =  48721.146 ±(99.9%) 196.908 ns/op

  Percentiles, ns/op:
      p(0.0000) =   2336.000 ns/op
     p(50.0000) =   6664.000 ns/op
     p(90.0000) =   7520.000 ns/op
     p(95.0000) =   7784.000 ns/op
     p(99.0000) =   8560.000 ns/op
     p(99.9000) =  24288.000 ns/op
     p(99.9900) = 94896128.000 ns/op
     p(99.9990) = 153878528.000 ns/op
     p(99.9999) = 654311424.000 ns/op
    p(100.0000) = 2092957696.000 ns/op
{noformat}

FastLongHistogram with Counter with ThreadLocal:
{noformat}
Result "fastLongThreadLocal":
  N = 1251201915
  mean =  84227.741 ±(99.9%) 1114.037 ns/op

  Percentiles, ns/op:
      p(0.0000) =   4056.000 ns/op
     p(50.0000) =   9760.000 ns/op
     p(90.0000) =  12336.000 ns/op
     p(95.0000) =  13648.000 ns/op
     p(99.0000) =  16544.000 ns/op
     p(99.9000) = 285696.000 ns/op
     p(99.9900) = 111017984.000 ns/op
     p(99.9990) = 172228608.000 ns/op
     p(99.9999) = 4445962240.000 ns/op
    p(100.0000) = 31742492672.000 ns/op
{noformat}

Result summary:
{noformat}
Benchmark                                           Mode         Cnt      Score 
     Error  Units
MultiHistogramBenchmark.fastLong                  sample  1373429925  48721.146 
±  196.908  ns/op
MultiHistogramBenchmark.fastLongThreadLocal       sample  1251201915  84227.741 
± 1114.037  ns/op
MultiHistogramBenchmark.testHDRAtomic             sample  1320949956  27066.038 
±  177.677  ns/op
MultiHistogramBenchmark.testHDRConcurrent         sample  1330869473  26586.309 
±  170.456  ns/op
MultiHistogramBenchmark.testMutableTimeHistogram  sample  1322279057  53766.021 
±  238.439  ns/op
{noformat}

So with more Counters in memory and more threads, removing the ThreadLocal 
usage results in a ~40% improvement, with up to an order of magnitude 
improvement at upper percentiles.

We may still want to investigate using HDRHistogram, since its implementations 
outperform both versions.  But in the short term this should still be an 
improvement.

> Counters are expensive...
> -------------------------
>
>                 Key: HBASE-16146
>                 URL: https://issues.apache.org/jira/browse/HBASE-16146
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: stack
>         Attachments: HBASE-16146.branch-1.3.001.patch, counters.patch, 
> less_and_less_counters.png
>
>
> Doing workloadc, perf shows 10%+ of CPU being spent on counter#add. If I 
> disable some of the hot ones -- see patch -- I can get 10% more throughput 
> (390k to 440k). Figure something better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-16146) Counters are expensive...

Reply via email to