[
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012511#comment-17012511
]
Jordan West edited comment on CASSANDRA-15213 at 1/10/20 4:14 PM:
------------------------------------------------------------------
I've updated [https://github.com/jrwest/cassandra/tree/jwest/15213] with the
changes you suggested. Regarding the extra bucket, I realized that even more
simply we could return {{bucketOffsets.length}} if the estimate is greater than
or equal to it – saving us an access in this case altogether.
Also added a very simple striping approach -- without any attempts at
distributed the buckets more uniformly. The performance is about the same (the
runs are within each others' margin of error) and still significantly better
than before CASSANDRA-14281. From the memory consumption perspective, however,
4 stripes ends up allocating more than 12MiB after startup. This isn't
surprising because with a single stripe (or pre CASSANDRA-14281) after startup
was a little more than 3MiB and we're quadrupling the size of each array.
However, these won't grow under contention like {{LongAdder[]}} would: memory
consumption is on the order of the number of histograms created not number
created and contention as before. I've found comparable performance using 2
stripes on my 4 core machine. I have an 8 core available to me I can test with
tomorrow. If the memory consumption is still a concern I can investigate
on-demand striping but I prefer the simpler approach and think the trade-offs
of 2-4 stripes is reasonable.
{code:java}
3.0 (4 core):
[java] Benchmark Mode Cnt
Score Error Units
[java] LatencyTrackingBench.benchInsertToDEHR thrpt 5
5848504.963 ± 147881.511 ops/s
[java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5
1946154.537 ± 623185.290 ops/s
Trunk (4 core)
[java] Benchmark Mode Cnt
Score Error Units
[java] LatencyTrackingBench.benchInsertToDEHR thrpt 5
21887453.678 ± 2108481.285 ops/s
[java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5
8908817.316 ± 115453.642 ops/s
15213 (4 core, linear search changes only):
[java] Benchmark Mode Cnt
Score Error Units
[java] LatencyTrackingBench.benchInsertToDEHR thrpt 5
24646022.304 ± 1052105.818 ops/s
[java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5
9175928.594 ± 269984.204 ops/s
15213 (4 core, 4 stripes, striping and linear search):
[java] Benchmark Mode Cnt Score
Error Units
[java] LatencyTrackingBench.benchInsertToDEHR thrpt 5
18818181.576 ± 506997.366 ops/s
[java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5
8895569.814 ± 154219.113 ops/s
{code}
was (Author: jrwest):
I've updated [https://github.com/jrwest/cassandra/tree/jwest/15213] with the
changes you suggested. Regarding the extra bucket, I realized that even more
simply we could return {{bucketOffsets.length}} if the estimate is greater than
or equal to it – saving us an access in this case altogether.
Also added a very simple striping approach. The performance is about the same
(the runs are within each others' margin of error) and still significantly
better than before CASSANDRA-14281. From the memory consumption perspective,
however, 4 stripes ends up allocating more than 12MiB after startup. This isn't
surprising because with a single stripe (or pre CASSANDRA-14281) after startup
was a little more than 3MiB and we're quadrupling the size of each array.
However, these won't grow under contention like {{LongAdder[]}} would: memory
consumption is on the order of the number of histograms created not number
created and contention as before. I've found comparable performance using 2
stripes on my 4 core machine. I have an 8 core available to me I can test with
tomorrow. If the memory consumption is still a concern I can investigate
on-demand striping but I prefer the simpler approach and think the trade-offs
of 2-4 stripes is reasonable.
{code:java}
3.0 (4 core):
[java] Benchmark Mode Cnt
Score Error Units
[java] LatencyTrackingBench.benchInsertToDEHR thrpt 5
5848504.963 ± 147881.511 ops/s
[java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5
1946154.537 ± 623185.290 ops/s
Trunk (4 core)
[java] Benchmark Mode Cnt
Score Error Units
[java] LatencyTrackingBench.benchInsertToDEHR thrpt 5
21887453.678 ± 2108481.285 ops/s
[java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5
8908817.316 ± 115453.642 ops/s
15213 (4 core, linear search changes only):
[java] Benchmark Mode Cnt
Score Error Units
[java] LatencyTrackingBench.benchInsertToDEHR thrpt 5
24646022.304 ± 1052105.818 ops/s
[java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5
9175928.594 ± 269984.204 ops/s
15213 (4 core, 4 stripes, striping and linear search):
[java] Benchmark Mode Cnt Score
Error Units
[java] LatencyTrackingBench.benchInsertToDEHR thrpt 5
18818181.576 ± 506997.366 ops/s
[java] LatencyTrackingBench.benchLatencyMetricsWrite thrpt 5
8895569.814 ± 154219.113 ops/s
{code}
> DecayingEstimatedHistogramReservoir Inefficiencies
> --------------------------------------------------
>
> Key: CASSANDRA-15213
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
> Project: Cassandra
> Issue Type: Bug
> Components: Observability/Metrics
> Reporter: Benedict Elliott Smith
> Assignee: Jordan West
> Priority: Normal
> Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user
> schemas, and this will grow significantly under contention and user schemas
> with many tables. This is because {{LongAdder}} is a very heavy class
> designed for single contended values.
> ** This can likely be improved significantly, without significant loss of
> performance in the contended case, by simply increasing the size of our
> primitive backing array and providing multiple buckets, with each thread
> picking a bucket to increment, or simply multiple backing arrays. Probably a
> better way still to do this would be to introduce some competition detection
> to the update, much like {{LongAdder}} utilises, that increases the number of
> backing arrays under competition.
> ** To save memory this approach could partition the space into chunks that
> are likely to be updated together, so that we do not need to duplicate the
> entire array under competition.
> * Similarly, binary search is costly and a measurable cost as a share of the
> new networking work (without filtering it was > 10% of the CPU used overall).
> We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply,
> to save the random memory access costs.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]