Yuqi Yan created CASSANDRA-20158:
------------------------------------

             Summary: IntervalTree should support updateIntervals for 
checkpoint when ranges are unchanged
                 Key: CASSANDRA-20158
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20158
             Project: Apache Cassandra
          Issue Type: Improvement
            Reporter: Yuqi Yan
         Attachments: image-2024-12-20-02-39-53-420.png, 
image-2024-12-20-02-41-06-544.png, image-2024-12-20-02-42-52-003.png

We observed very slow compaction and sometimes stuck memtable flushing hence 
caused write latency spikes when the cluster has large number of SSTables 
(~20K), similar to what was observed in CASSANDRA-19596.

Looking deeper into when these interval tree is rebuilt - there is actually no 
need to do rebuild all the time for checkpoint() calls.
 
updateLiveSet(toUpdate, staged.update) this is updating the current version of 
SSTableReader within the View. However the update isn't always changing the 
ranges of the SSTableReader (low, high).
 
One Example: * IndexSummaryRedistribution.adjustSamplingLevels()
 ** 
SSTableReader replacement = sstable.cloneWithNewSummarySamplingLevel(cfs, 
entry.newSamplingLevel);
 ** This is changing the Metadata only and the ranges are unchanged
 
Considering this, rebuilding the entire IntervalTree will not be required, 
instead IntervalTree should support replacing these SSTableReader instead.
 
If we're rebuilding the tree, complexity is O(n(logn)^2) in current trunk as 
we're repeating the O(nlogn) sort on every node creation, after CASSANDRA-19596 
this will be O(nlogn), but with update supported, some of the updateLiveSet 
calls can be optimized to O(m(logn)^2) where m is the number of SSTableReaders 
we attempt to replace, which we have m << n (number of SSTables) in most cases.
 
This is achieved by # finding the node containing the SSTable (logn)
 # binary search and replacing the SSTableReader from the node (logn)
 # To support CAS update, 1 and 2 need to done by copying the path and 
re-create the affected nodes on the path

 
The experiment I did was on a 2 rings setup, one with trunk and one with this 
patch. ~15K SSTables (with LCS, sstable size was 50MB, single_uplevel enabled). 
stress-test with 1:1 rw ratio.

Result shows that ~15% of the checkpoint calls don't necessarily need to 
rebuild the tree.

!image-2024-12-20-02-41-06-544.png|width=803,height=133!

Compaction throughput (scanner read throughput) was increased from 130MB/s to 
200MB/s

!image-2024-12-20-02-39-53-420.png|width=1378,height=136!

Checkpoint finish time reduced from mean ~1.5s to ~800ms

!image-2024-12-20-02-42-52-003.png|width=1100,height=186!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to