[
https://issues.apache.org/jira/browse/CASSANDRA-20158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936591#comment-17936591
]
Yuqi Yan commented on CASSANDRA-20158:
--------------------------------------
[~aweisberg] Correct. It was done with the default preemptive opening interval
which was 50M I believe. At this point I have a feeling that the performance
improvement was largely from creating fewer objects (Interval, IntervalNode,
etc), but haven't got time to do more profiling there. The trunk version of
19596 is saving time in creating fewer `Interval`, and this replace/add can
help in saving time in creating the `IntervalNode`
> I have a version of add and replace that is integrated with the recent trunk
> change that requires sorted arrays
That's awesome. I spent some time last week on the trunk version - so in trunk
we now have unique instanceID for each SSTableReader. The added assumption is
that the min/maxOrdering array requires sorted instanceId when multiple
SSTableReader share the same first/end. `Interval.minOrdering()` and
`Interval.maxOrdering()` will now do binary search on (min/max, instanceID)
order as well.
I think for the Add method it won't have much issue, but for the replace method
it breaks the sorted instanceID.
To fix that we can reuse the `buildUpdatedArray` to construct the new
min/maxOrdering array, just provide the key/value for the replacementMap as
removes/adds. When building the partial updated tree we'll need to sort the
updated intersect list by (min/max, instanceID) on the leaf node.
(just curious: why we have to sort with the instanceID here?)
> IntervalTree should support copyAndReplace for checkpoint when ranges are
> unchanged
> -----------------------------------------------------------------------------------
>
> Key: CASSANDRA-20158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20158
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/Compaction, Local/SSTable
> Reporter: Yuqi Yan
> Assignee: Yuqi Yan
> Priority: Normal
> Fix For: 4.1.x
>
> Attachments: image-2024-12-20-02-39-53-420.png,
> image-2024-12-20-02-41-06-544.png, image-2024-12-20-02-42-52-003.png
>
> Time Spent: 1h 50m
> Remaining Estimate: 0h
>
> We observed very slow compaction and sometimes stuck memtable flushing hence
> caused write latency spikes when the cluster has large number of SSTables
> (~20K), similar to what was observed in CASSANDRA-19596.
> Looking deeper into when these interval tree is rebuilt - there is actually
> no need to do rebuild all the time for checkpoint() calls.
>
> updateLiveSet(toUpdate, staged.update) this is updating the current version
> of SSTableReader within the View. However the update isn't always changing
> the ranges of the SSTableReader (low, high).
>
> One Example:
> * IndexSummaryRedistribution.adjustSamplingLevels()
> ** SSTableReader replacement = sstable.cloneWithNewSummarySamplingLevel(cfs,
> entry.newSamplingLevel);
> ** This is changing the Metadata only and the ranges are unchanged
>
> Considering this, rebuilding the entire IntervalTree will not be required,
> instead IntervalTree should support replacing these SSTableReader.
>
> If we're rebuilding the tree, complexity is O(n(logn)^2) in current trunk as
> we're repeating the O(nlogn) sort on every node creation, after
> CASSANDRA-19596 this will be O(nlogn), but with update supported, some of the
> updateLiveSet calls can be optimized to O(m(logn)^2) where m is the number of
> SSTableReaders we attempt to replace, which we have m << n (number of
> SSTables) in most cases.
>
> This is achieved by
> # finding the node containing the SSTable (logn)
> # binary search and replacing the SSTableReader from the node (logn)
> # To support CAS update, 1 and 2 need to done by copying the path and
> re-create the affected nodes on the path
>
> The experiment I did was on a 2 rings setup, one on 4.1+CASSANDRA-19596
> (marked as trunk), and the other on 4.1+CASSANDRA-19596+this patch (marked
> as new). ~15K SSTables (with LCS, sstable size was 50MB, single_uplevel
> enabled). stress-test with 1:1 rw ratio.
> Result shows that ~15% of the checkpoint calls don't necessarily need to
> rebuild the tree.
> !image-2024-12-20-02-41-06-544.png|width=803,height=133!
> Compaction throughput (scanner read throughput) was increased from 130MB/s to
> 200MB/s
> !image-2024-12-20-02-39-53-420.png|width=1378,height=136!
> Checkpoint finish time reduced from mean ~1.5s to ~800ms
> !image-2024-12-20-02-42-52-003.png|width=1100,height=186!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]