[ https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012511#comment-17012511 ]
Jordan West edited comment on CASSANDRA-15213 at 1/10/20 4:14 PM: ------------------------------------------------------------------ I've updated [https://github.com/jrwest/cassandra/tree/jwest/15213] with the changes you suggested. Regarding the extra bucket, I realized that even more simply we could return {{bucketOffsets.length}} if the estimate is greater than or equal to it – saving us an access in this case altogether. Also added a very simple striping approach -- without any attempts at distributed the buckets more uniformly. The performance is about the same (the runs are within each others' margin of error) and still significantly better than before CASSANDRA-14281. From the memory consumption perspective, however, 4 stripes ends up allocating more than 12MiB after startup. This isn't surprising because with a single stripe (or pre CASSANDRA-14281) after startup was a little more than 3MiB and we're quadrupling the size of each array. However, these won't grow under contention like {{LongAdder[]}} would: memory consumption is on the order of the number of histograms created not number created and contention as before. I've found comparable performance using 2 stripes on my 4 core machine. I have an 8 core available to me I can test with tomorrow. If the memory consumption is still a concern I can investigate on-demand striping but I prefer the simpler approach and think the trade-offs of 2-4 stripes is reasonable. {code:java} 3.0 (4 core): [java] Benchmark Mode Cnt Score Error Units [java] LatencyTrackingBench.benchInsertToDEHR thrpt 5 5848504.963 ± 147881.511 ops/s [java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5 1946154.537 ± 623185.290 ops/s Trunk (4 core) [java] Benchmark Mode Cnt Score Error Units [java] LatencyTrackingBench.benchInsertToDEHR thrpt 5 21887453.678 ± 2108481.285 ops/s [java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5 8908817.316 ± 115453.642 ops/s 15213 (4 core, linear search changes only): [java] Benchmark Mode Cnt Score Error Units [java] LatencyTrackingBench.benchInsertToDEHR thrpt 5 24646022.304 ± 1052105.818 ops/s [java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5 9175928.594 ± 269984.204 ops/s 15213 (4 core, 4 stripes, striping and linear search): [java] Benchmark Mode Cnt Score Error Units [java] LatencyTrackingBench.benchInsertToDEHR thrpt 5 18818181.576 ± 506997.366 ops/s [java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5 8895569.814 ± 154219.113 ops/s {code} was (Author: jrwest): I've updated [https://github.com/jrwest/cassandra/tree/jwest/15213] with the changes you suggested. Regarding the extra bucket, I realized that even more simply we could return {{bucketOffsets.length}} if the estimate is greater than or equal to it – saving us an access in this case altogether. Also added a very simple striping approach. The performance is about the same (the runs are within each others' margin of error) and still significantly better than before CASSANDRA-14281. From the memory consumption perspective, however, 4 stripes ends up allocating more than 12MiB after startup. This isn't surprising because with a single stripe (or pre CASSANDRA-14281) after startup was a little more than 3MiB and we're quadrupling the size of each array. However, these won't grow under contention like {{LongAdder[]}} would: memory consumption is on the order of the number of histograms created not number created and contention as before. I've found comparable performance using 2 stripes on my 4 core machine. I have an 8 core available to me I can test with tomorrow. If the memory consumption is still a concern I can investigate on-demand striping but I prefer the simpler approach and think the trade-offs of 2-4 stripes is reasonable. {code:java} 3.0 (4 core): [java] Benchmark Mode Cnt Score Error Units [java] LatencyTrackingBench.benchInsertToDEHR thrpt 5 5848504.963 ± 147881.511 ops/s [java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5 1946154.537 ± 623185.290 ops/s Trunk (4 core) [java] Benchmark Mode Cnt Score Error Units [java] LatencyTrackingBench.benchInsertToDEHR thrpt 5 21887453.678 ± 2108481.285 ops/s [java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5 8908817.316 ± 115453.642 ops/s 15213 (4 core, linear search changes only): [java] Benchmark Mode Cnt Score Error Units [java] LatencyTrackingBench.benchInsertToDEHR thrpt 5 24646022.304 ± 1052105.818 ops/s [java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5 9175928.594 ± 269984.204 ops/s 15213 (4 core, 4 stripes, striping and linear search): [java] Benchmark Mode Cnt Score Error Units [java] LatencyTrackingBench.benchInsertToDEHR thrpt 5 18818181.576 ± 506997.366 ops/s [java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5 8895569.814 ± 154219.113 ops/s {code} > DecayingEstimatedHistogramReservoir Inefficiencies > -------------------------------------------------- > > Key: CASSANDRA-15213 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15213 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics > Reporter: Benedict Elliott Smith > Assignee: Jordan West > Priority: Normal > Fix For: 4.0-beta > > > * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user > schemas, and this will grow significantly under contention and user schemas > with many tables. This is because {{LongAdder}} is a very heavy class > designed for single contended values. > ** This can likely be improved significantly, without significant loss of > performance in the contended case, by simply increasing the size of our > primitive backing array and providing multiple buckets, with each thread > picking a bucket to increment, or simply multiple backing arrays. Probably a > better way still to do this would be to introduce some competition detection > to the update, much like {{LongAdder}} utilises, that increases the number of > backing arrays under competition. > ** To save memory this approach could partition the space into chunks that > are likely to be updated together, so that we do not need to duplicate the > entire array under competition. > * Similarly, binary search is costly and a measurable cost as a share of the > new networking work (without filtering it was > 10% of the CPU used overall). > We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, > to save the random memory access costs. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org