[
https://issues.apache.org/jira/browse/CASSANDRA-20158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuqi Yan updated CASSANDRA-20158:
---------------------------------
Description:
We observed very slow compaction and sometimes stuck memtable flushing hence
caused write latency spikes when the cluster has large number of SSTables
(~20K), similar to what was observed in CASSANDRA-19596.
Looking deeper into when these interval tree is rebuilt - there is actually no
need to do rebuild all the time for checkpoint() calls.
updateLiveSet(toUpdate, staged.update) this is updating the current version of
SSTableReader within the View. However the update isn't always changing the
ranges of the SSTableReader (low, high).
One Example:
* IndexSummaryRedistribution.adjustSamplingLevels()
** SSTableReader replacement = sstable.cloneWithNewSummarySamplingLevel(cfs,
entry.newSamplingLevel);
** This is changing the Metadata only and the ranges are unchanged
Considering this, rebuilding the entire IntervalTree will not be required,
instead IntervalTree should support replacing these SSTableReader.
If we're rebuilding the tree, complexity is O(n(logn)^2) in current trunk as
we're repeating the O(nlogn) sort on every node creation, after CASSANDRA-19596
this will be O(nlogn), but with update supported, some of the updateLiveSet
calls can be optimized to O(m(logn)^2) where m is the number of SSTableReaders
we attempt to replace, which we have m << n (number of SSTables) in most cases.
This is achieved by
# finding the node containing the SSTable (logn)
# binary search and replacing the SSTableReader from the node (logn)
# To support CAS update, 1 and 2 need to done by copying the path and
re-create the affected nodes on the path
The experiment I did was on a 2 rings setup, one on 4.1+CASSANDRA-19596 (marked
as trunk), and the other on 4.1+CASSANDRA-19596+this patch (marked as new).
~15K SSTables (with LCS, sstable size was 50MB, single_uplevel enabled).
stress-test with 1:1 rw ratio.
Result shows that ~15% of the checkpoint calls don't necessarily need to
rebuild the tree.
!image-2024-12-20-02-41-06-544.png|width=803,height=133!
Compaction throughput (scanner read throughput) was increased from 130MB/s to
200MB/s
!image-2024-12-20-02-39-53-420.png|width=1378,height=136!
Checkpoint finish time reduced from mean ~1.5s to ~800ms
!image-2024-12-20-02-42-52-003.png|width=1100,height=186!
was:
We observed very slow compaction and sometimes stuck memtable flushing hence
caused write latency spikes when the cluster has large number of SSTables
(~20K), similar to what was observed in CASSANDRA-19596.
Looking deeper into when these interval tree is rebuilt - there is actually no
need to do rebuild all the time for checkpoint() calls.
updateLiveSet(toUpdate, staged.update) this is updating the current version of
SSTableReader within the View. However the update isn't always changing the
ranges of the SSTableReader (low, high).
One Example:
* IndexSummaryRedistribution.adjustSamplingLevels()
** SSTableReader replacement = sstable.cloneWithNewSummarySamplingLevel(cfs,
entry.newSamplingLevel);
** This is changing the Metadata only and the ranges are unchanged
Considering this, rebuilding the entire IntervalTree will not be required,
instead IntervalTree should support replacing these SSTableReader.
If we're rebuilding the tree, complexity is O(n(logn)^2) in current trunk as
we're repeating the O(nlogn) sort on every node creation, after CASSANDRA-19596
this will be O(nlogn), but with update supported, some of the updateLiveSet
calls can be optimized to O(m(logn)^2) where m is the number of SSTableReaders
we attempt to replace, which we have m << n (number of SSTables) in most cases.
This is achieved by
# finding the node containing the SSTable (logn)
# binary search and replacing the SSTableReader from the node (logn)
# To support CAS update, 1 and 2 need to done by copying the path and
re-create the affected nodes on the path
The experiment I did was on a 2 rings setup, one with trunk with
CASSANDRA-19596, and the other with CASSANDRA-19596 + this patch. ~15K SSTables
(with LCS, sstable size was 50MB, single_uplevel enabled). stress-test with 1:1
rw ratio.
Result shows that ~15% of the checkpoint calls don't necessarily need to
rebuild the tree.
!image-2024-12-20-02-41-06-544.png|width=803,height=133!
Compaction throughput (scanner read throughput) was increased from 130MB/s to
200MB/s
!image-2024-12-20-02-39-53-420.png|width=1378,height=136!
Checkpoint finish time reduced from mean ~1.5s to ~800ms
!image-2024-12-20-02-42-52-003.png|width=1100,height=186!
> IntervalTree should support updateIntervals for checkpoint when ranges are
> unchanged
> ------------------------------------------------------------------------------------
>
> Key: CASSANDRA-20158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20158
> Project: Apache Cassandra
> Issue Type: Improvement
> Reporter: Yuqi Yan
> Priority: Normal
> Attachments: image-2024-12-20-02-39-53-420.png,
> image-2024-12-20-02-41-06-544.png, image-2024-12-20-02-42-52-003.png
>
>
> We observed very slow compaction and sometimes stuck memtable flushing hence
> caused write latency spikes when the cluster has large number of SSTables
> (~20K), similar to what was observed in CASSANDRA-19596.
> Looking deeper into when these interval tree is rebuilt - there is actually
> no need to do rebuild all the time for checkpoint() calls.
>
> updateLiveSet(toUpdate, staged.update) this is updating the current version
> of SSTableReader within the View. However the update isn't always changing
> the ranges of the SSTableReader (low, high).
>
> One Example:
> * IndexSummaryRedistribution.adjustSamplingLevels()
> ** SSTableReader replacement = sstable.cloneWithNewSummarySamplingLevel(cfs,
> entry.newSamplingLevel);
> ** This is changing the Metadata only and the ranges are unchanged
>
> Considering this, rebuilding the entire IntervalTree will not be required,
> instead IntervalTree should support replacing these SSTableReader.
>
> If we're rebuilding the tree, complexity is O(n(logn)^2) in current trunk as
> we're repeating the O(nlogn) sort on every node creation, after
> CASSANDRA-19596 this will be O(nlogn), but with update supported, some of the
> updateLiveSet calls can be optimized to O(m(logn)^2) where m is the number of
> SSTableReaders we attempt to replace, which we have m << n (number of
> SSTables) in most cases.
>
> This is achieved by
> # finding the node containing the SSTable (logn)
> # binary search and replacing the SSTableReader from the node (logn)
> # To support CAS update, 1 and 2 need to done by copying the path and
> re-create the affected nodes on the path
>
> The experiment I did was on a 2 rings setup, one on 4.1+CASSANDRA-19596
> (marked as trunk), and the other on 4.1+CASSANDRA-19596+this patch (marked
> as new). ~15K SSTables (with LCS, sstable size was 50MB, single_uplevel
> enabled). stress-test with 1:1 rw ratio.
> Result shows that ~15% of the checkpoint calls don't necessarily need to
> rebuild the tree.
> !image-2024-12-20-02-41-06-544.png|width=803,height=133!
> Compaction throughput (scanner read throughput) was increased from 130MB/s to
> 200MB/s
> !image-2024-12-20-02-39-53-420.png|width=1378,height=136!
> Checkpoint finish time reduced from mean ~1.5s to ~800ms
> !image-2024-12-20-02-42-52-003.png|width=1100,height=186!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]