[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012511#comment-17012511
 ] 

Jordan West commented on CASSANDRA-15213:
-----------------------------------------

I've updated [https://github.com/jrwest/cassandra/tree/jwest/15213] with the 
changes you suggested. Regarding the extra bucket, I realized that even more 
simply we could return {{bucketOffsets.length}} if the estimate is greater than 
or equal to it – saving us an access in this case altogether. 

Also added a very simple striping approach. The performance is about the same 
(the runs are within each others' margin of error) and still significantly 
better than before CASSANDRA-14281. From the memory consumption perspective, 
however, 4 stripes ends up allocating more than 12MiB after startup. This isn't 
surprising because with a single stripe (or pre CASSANDRA-14281) after startup 
was a little more than 3MiB and we're quadrupling the size of each array.  
However, these won't grow under contention like {{LongAdder[]}} would: memory 
consumption is on the order of the number of histograms created not number 
created and contention as before. I've found comparable performance using 2 
stripes on my 4 core machine. I have an 8 core available to me I can test with 
tomorrow. If the memory consumption is still a concern I can investigate 
on-demand striping but I prefer the simpler approach and think the trade-offs 
of 2-4 stripes is reasonable. 

 
{code:java}
 3.0 (4 core):
     [java] Benchmark                                       Mode  Cnt        
Score        Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
5848504.963 ± 147881.511  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5  
1946154.537 ± 623185.290  ops/s


Trunk (4 core)


     [java] Benchmark                                       Mode  Cnt         
Score         Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
21887453.678 ± 2108481.285  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
8908817.316 ±  115453.642  ops/s


15213 (4 core, linear search changes only):


     [java] Benchmark                                       Mode  Cnt         
Score         Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
24646022.304 ± 1052105.818  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
9175928.594 ±  269984.204  ops/s




15213 (4 core, 4 stripes, striping and linear search):


[java] Benchmark                                       Mode  Cnt         Score  
      Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
18818181.576 ± 506997.366  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
8895569.814 ± 154219.113  ops/s


 {code}
 

> DecayingEstimatedHistogramReservoir Inefficiencies
> --------------------------------------------------
>
>                 Key: CASSANDRA-15213
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Observability/Metrics
>            Reporter: Benedict Elliott Smith
>            Assignee: Jordan West
>            Priority: Normal
>             Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to