[
https://issues.apache.org/jira/browse/CASSANDRA-20159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17907969#comment-17907969
]
Yuqi Yan commented on CASSANDRA-20159:
--------------------------------------
Attached the patch for proposal 2 in CASSANDRA-20164. Happy to share the
metrics if anyone wants to do similar testing for these patches. Real time
compaction metric was not available in 4.1 and I'm using the one in
CASSANDRA-13890
> memtable flush stuck for minutes on slow replaceFlushed view update
> -------------------------------------------------------------------
>
> Key: CASSANDRA-20159
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20159
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/Compaction, Local/SSTable
> Reporter: Yuqi Yan
> Assignee: Yuqi Yan
> Priority: Normal
> Fix For: 4.1.x
>
> Attachments: image-2024-12-20-06-14-54-493.png,
> image-2024-12-20-06-39-04-194.png
>
>
> We observed slow memtable flushing hence caused write latency spikes when the
> cluster has large number of SSTables (~15K). I looked into the CPU profiling,
> seems that most of the CPU was busy doing updates on the View from compaction.
> !image-2024-12-20-06-14-54-493.png|width=570,height=476!
> And taking a closer look - these checkpoint calls are mostly from
> maybeReopenEarly. I don't fully understand how this early open mechanism
> works, but according to some recent investigation, the checkpoint calls can
> become expensive (observed in CASSANDRA-19596, CASSANDRA-20158)
> replaceFlushed update also requires an entire SSTableIntervalTree rebuild,
> which can take significantly longer time when number of SSTable grows
>
> {code:java}
> static Function<View, View> replaceFlushed(final Memtable memtable, final
> Iterable<SSTableReader> flushed)
> {
> return new Function<View, View>()
> {
> public View apply(View view)
> {
> List<Memtable> flushingMemtables =
> copyOf(filter(view.flushingMemtables, not(equalTo(memtable))));
> assert flushingMemtables.size() == view.flushingMemtables.size()
> - 1;
> if (flushed == null || Iterables.isEmpty(flushed))
> return new View(view.liveMemtables, flushingMemtables,
> view.sstablesMap,
> view.compactingMap, view.intervalTree);
> Map<SSTableReader, SSTableReader> sstableMap =
> replace(view.sstablesMap, emptySet(), flushed);
> return new View(view.liveMemtables, flushingMemtables,
> sstableMap, view.compactingMap,
> SSTableIntervalTree.build(sstableMap.keySet()));
> }
> };
> } {code}
> When a node is busy with compaction, the {{replaceFlushed}} update can easily
> encounter contention. Assuming 1 thread doing memtable flush and 6 concurrent
> compactors running full speed, these 7 threads will consume similar time (say
> T ms) to generate the new View. There is only 1/7 chance that the flush can
> succeed, which means the expectation finish time will be 7 * T ms...
>
> What's worse, replaceFlushed is not only competing the View updates with
> checkpoint calls from compaction. In some of my testing, I noticed that there
> is a surge in unmarkCompaction calls every 1 hour (from default
> index_summary_resize_interval = 60m) and this last for 2 minutes or longer.
> During this 2 minutes windows, replaceFlushed finishTime is significantly
> longer and we see pending mutationstage tasks starting to pile up (all of
> them are waiting for memtable to be flushed) and hence writes started to
> timeout.
> !image-2024-12-20-06-39-04-194.png|width=1259,height=292!
> These unmarkCompacting calls was from
> IndexSummaryRedistribution.redistributeSummaries(). If I understand it
> correctly, what happens here is that
> # IndexSummaryManager marks all SSTables as compacting, which is done in
> one-go per cfs by iterating all cfs and add all SSTables to compactingMap
> # in adjustSamplingLevels(), it calculates the sampling level for each
> sstable (probably also did some updates on SSTableReader?)
> # unmark compacting for the SSTables which don't need downsample
> Step 3 is done *one by one* hence causing trouble here
> {code:java}
> if (remainingSpace > 0)
> {
> Pair<List<SSTableReader>, List<ResampleEntry>> result =
> distributeRemainingSpace(toDownsample, remainingSpace);
> toDownsample = result.right;
> newSSTables.addAll(result.left);
> for (SSTableReader sstable : result.left)
> transactions.get(sstable.metadata().id).cancel(sstable);
> } {code}
>
> Here I have 2 proposals in improving this:
> # Fix this redistributeSummaries() to group and unmark compacting for these
> SSTables in one go
> # Make replaceFlushed faster - by supporting addSSTables in IntervalTree.
> For replaceFlushed calls we don't remove anything from the existing
> IntervalTree. That being said, insert a new interval into the IntervalTree
> can also be fast (at least better than rebuild the entire tree)
> ** The only concern here is that addSSTables might create imbalanced tree,
> but we rebuild the tree very, very frequently - I think this should not be a
> huge concern
> Let me know what you think about this. I'm not very sure why this cancel was
> done one by one, is it possible to do it in batch grouped by cfs (just like
> how we mark them as compacting)?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]