Yuqi Yan created CASSANDRA-20158:
------------------------------------
Summary: IntervalTree should support updateIntervals for
checkpoint when ranges are unchanged
Key: CASSANDRA-20158
URL: https://issues.apache.org/jira/browse/CASSANDRA-20158
Project: Apache Cassandra
Issue Type: Improvement
Reporter: Yuqi Yan
Attachments: image-2024-12-20-02-39-53-420.png,
image-2024-12-20-02-41-06-544.png, image-2024-12-20-02-42-52-003.png
We observed very slow compaction and sometimes stuck memtable flushing hence
caused write latency spikes when the cluster has large number of SSTables
(~20K), similar to what was observed in CASSANDRA-19596.
Looking deeper into when these interval tree is rebuilt - there is actually no
need to do rebuild all the time for checkpoint() calls.
updateLiveSet(toUpdate, staged.update) this is updating the current version of
SSTableReader within the View. However the update isn't always changing the
ranges of the SSTableReader (low, high).
One Example: * IndexSummaryRedistribution.adjustSamplingLevels()
**
SSTableReader replacement = sstable.cloneWithNewSummarySamplingLevel(cfs,
entry.newSamplingLevel);
** This is changing the Metadata only and the ranges are unchanged
Considering this, rebuilding the entire IntervalTree will not be required,
instead IntervalTree should support replacing these SSTableReader instead.
If we're rebuilding the tree, complexity is O(n(logn)^2) in current trunk as
we're repeating the O(nlogn) sort on every node creation, after CASSANDRA-19596
this will be O(nlogn), but with update supported, some of the updateLiveSet
calls can be optimized to O(m(logn)^2) where m is the number of SSTableReaders
we attempt to replace, which we have m << n (number of SSTables) in most cases.
This is achieved by # finding the node containing the SSTable (logn)
# binary search and replacing the SSTableReader from the node (logn)
# To support CAS update, 1 and 2 need to done by copying the path and
re-create the affected nodes on the path
The experiment I did was on a 2 rings setup, one with trunk and one with this
patch. ~15K SSTables (with LCS, sstable size was 50MB, single_uplevel enabled).
stress-test with 1:1 rw ratio.
Result shows that ~15% of the checkpoint calls don't necessarily need to
rebuild the tree.
!image-2024-12-20-02-41-06-544.png|width=803,height=133!
Compaction throughput (scanner read throughput) was increased from 130MB/s to
200MB/s
!image-2024-12-20-02-39-53-420.png|width=1378,height=136!
Checkpoint finish time reduced from mean ~1.5s to ~800ms
!image-2024-12-20-02-42-52-003.png|width=1100,height=186!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]