[ 
https://issues.apache.org/jira/browse/CASSANDRA-20158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuqi Yan updated CASSANDRA-20158:
---------------------------------
    Description: 
We observed very slow compaction and sometimes stuck memtable flushing hence 
caused write latency spikes when the cluster has large number of SSTables 
(~20K), similar to what was observed in CASSANDRA-19596.

Looking deeper into when these interval tree is rebuilt - there is actually no 
need to do rebuild all the time for checkpoint() calls.
 
updateLiveSet(toUpdate, staged.update) this is updating the current version of 
SSTableReader within the View. However the update isn't always changing the 
ranges of the SSTableReader (low, high).
 
One Example:
 * IndexSummaryRedistribution.adjustSamplingLevels()
 ** SSTableReader replacement = sstable.cloneWithNewSummarySamplingLevel(cfs, 
entry.newSamplingLevel);
 ** This is changing the Metadata only and the ranges are unchanged
 

Considering this, rebuilding the entire IntervalTree will not be required, 
instead IntervalTree should support replacing these SSTableReader.
 
If we're rebuilding the tree, complexity is O(n(logn)^2) in current trunk as 
we're repeating the O(nlogn) sort on every node creation, after CASSANDRA-19596 
this will be O(nlogn), but with update supported, some of the updateLiveSet 
calls can be optimized to O(m(logn)^2) where m is the number of SSTableReaders 
we attempt to replace, which we have m << n (number of SSTables) in most cases.
 
This is achieved by
 # finding the node containing the SSTable (logn)
 # binary search and replacing the SSTableReader from the node (logn)
 # To support CAS update, 1 and 2 need to done by copying the path and 
re-create the affected nodes on the path

 
The experiment I did was on a 2 rings setup, one with trunk with 
CASSANDRA-19596, and the other with CASSANDRA-19596 + this patch. ~15K SSTables 
(with LCS, sstable size was 50MB, single_uplevel enabled). stress-test with 1:1 
rw ratio.

Result shows that ~15% of the checkpoint calls don't necessarily need to 
rebuild the tree.

!image-2024-12-20-02-41-06-544.png|width=803,height=133!

Compaction throughput (scanner read throughput) was increased from 130MB/s to 
200MB/s

!image-2024-12-20-02-39-53-420.png|width=1378,height=136!

Checkpoint finish time reduced from mean ~1.5s to ~800ms

!image-2024-12-20-02-42-52-003.png|width=1100,height=186!

  was:
We observed very slow compaction and sometimes stuck memtable flushing hence 
caused write latency spikes when the cluster has large number of SSTables 
(~20K), similar to what was observed in CASSANDRA-19596.

Looking deeper into when these interval tree is rebuilt - there is actually no 
need to do rebuild all the time for checkpoint() calls.
 
updateLiveSet(toUpdate, staged.update) this is updating the current version of 
SSTableReader within the View. However the update isn't always changing the 
ranges of the SSTableReader (low, high).
 
One Example:
 * IndexSummaryRedistribution.adjustSamplingLevels()
 ** SSTableReader replacement = sstable.cloneWithNewSummarySamplingLevel(cfs, 
entry.newSamplingLevel);
 ** This is changing the Metadata only and the ranges are unchanged
 

Considering this, rebuilding the entire IntervalTree will not be required, 
instead IntervalTree should support replacing these SSTableReader instead.
 
If we're rebuilding the tree, complexity is O(n(logn)^2) in current trunk as 
we're repeating the O(nlogn) sort on every node creation, after CASSANDRA-19596 
this will be O(nlogn), but with update supported, some of the updateLiveSet 
calls can be optimized to O(m(logn)^2) where m is the number of SSTableReaders 
we attempt to replace, which we have m << n (number of SSTables) in most cases.
 
This is achieved by
 # finding the node containing the SSTable (logn)
 # binary search and replacing the SSTableReader from the node (logn)
 # To support CAS update, 1 and 2 need to done by copying the path and 
re-create the affected nodes on the path

 
The experiment I did was on a 2 rings setup, one with trunk and one with this 
patch. ~15K SSTables (with LCS, sstable size was 50MB, single_uplevel enabled). 
stress-test with 1:1 rw ratio.

Result shows that ~15% of the checkpoint calls don't necessarily need to 
rebuild the tree.

!image-2024-12-20-02-41-06-544.png|width=803,height=133!

Compaction throughput (scanner read throughput) was increased from 130MB/s to 
200MB/s

!image-2024-12-20-02-39-53-420.png|width=1378,height=136!

Checkpoint finish time reduced from mean ~1.5s to ~800ms

!image-2024-12-20-02-42-52-003.png|width=1100,height=186!


> IntervalTree should support updateIntervals for checkpoint when ranges are 
> unchanged
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20158
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20158
>             Project: Apache Cassandra
>          Issue Type: Improvement
>            Reporter: Yuqi Yan
>            Priority: Normal
>         Attachments: image-2024-12-20-02-39-53-420.png, 
> image-2024-12-20-02-41-06-544.png, image-2024-12-20-02-42-52-003.png
>
>
> We observed very slow compaction and sometimes stuck memtable flushing hence 
> caused write latency spikes when the cluster has large number of SSTables 
> (~20K), similar to what was observed in CASSANDRA-19596.
> Looking deeper into when these interval tree is rebuilt - there is actually 
> no need to do rebuild all the time for checkpoint() calls.
>  
> updateLiveSet(toUpdate, staged.update) this is updating the current version 
> of SSTableReader within the View. However the update isn't always changing 
> the ranges of the SSTableReader (low, high).
>  
> One Example:
>  * IndexSummaryRedistribution.adjustSamplingLevels()
>  ** SSTableReader replacement = sstable.cloneWithNewSummarySamplingLevel(cfs, 
> entry.newSamplingLevel);
>  ** This is changing the Metadata only and the ranges are unchanged
>  
> Considering this, rebuilding the entire IntervalTree will not be required, 
> instead IntervalTree should support replacing these SSTableReader.
>  
> If we're rebuilding the tree, complexity is O(n(logn)^2) in current trunk as 
> we're repeating the O(nlogn) sort on every node creation, after 
> CASSANDRA-19596 this will be O(nlogn), but with update supported, some of the 
> updateLiveSet calls can be optimized to O(m(logn)^2) where m is the number of 
> SSTableReaders we attempt to replace, which we have m << n (number of 
> SSTables) in most cases.
>  
> This is achieved by
>  # finding the node containing the SSTable (logn)
>  # binary search and replacing the SSTableReader from the node (logn)
>  # To support CAS update, 1 and 2 need to done by copying the path and 
> re-create the affected nodes on the path
>  
> The experiment I did was on a 2 rings setup, one with trunk with 
> CASSANDRA-19596, and the other with CASSANDRA-19596 + this patch. ~15K 
> SSTables (with LCS, sstable size was 50MB, single_uplevel enabled). 
> stress-test with 1:1 rw ratio.
> Result shows that ~15% of the checkpoint calls don't necessarily need to 
> rebuild the tree.
> !image-2024-12-20-02-41-06-544.png|width=803,height=133!
> Compaction throughput (scanner read throughput) was increased from 130MB/s to 
> 200MB/s
> !image-2024-12-20-02-39-53-420.png|width=1378,height=136!
> Checkpoint finish time reduced from mean ~1.5s to ~800ms
> !image-2024-12-20-02-42-52-003.png|width=1100,height=186!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to