Yuqi Yan created CASSANDRA-20159:
------------------------------------
Summary: memtable flush stuck for minutes on slow replaceFlushed
view update
Key: CASSANDRA-20159
URL: https://issues.apache.org/jira/browse/CASSANDRA-20159
Project: Apache Cassandra
Issue Type: Improvement
Reporter: Yuqi Yan
Assignee: Yuqi Yan
Attachments: image-2024-12-20-06-14-54-493.png,
image-2024-12-20-06-39-04-194.png
We observed slow memtable flushing hence caused write latency spikes when the
cluster has large number of SSTables (~15K). I looked into the CPU profiling,
seems that most of the CPU was busy doing updates on the View from compaction.
!image-2024-12-20-06-14-54-493.png|width=570,height=476!
And taking a closer look - these checkpoint calls are mostly from
maybeReopenEarly. I don't fully understand how this early open mechanism works,
but according to some recent investigation, the checkpoint calls can become
expensive (observed in CASSANDRA-19596, CASSANDRA-20158)
replaceFlushed update also requires an entire SSTableIntervalTree rebuild,
which can take significantly longer time when number of SSTable grows
{code:java}
static Function<View, View> replaceFlushed(final Memtable memtable, final
Iterable<SSTableReader> flushed)
{
return new Function<View, View>()
{
public View apply(View view)
{
List<Memtable> flushingMemtables =
copyOf(filter(view.flushingMemtables, not(equalTo(memtable))));
assert flushingMemtables.size() == view.flushingMemtables.size() -
1;
if (flushed == null || Iterables.isEmpty(flushed))
return new View(view.liveMemtables, flushingMemtables,
view.sstablesMap,
view.compactingMap, view.intervalTree);
Map<SSTableReader, SSTableReader> sstableMap =
replace(view.sstablesMap, emptySet(), flushed);
return new View(view.liveMemtables, flushingMemtables, sstableMap,
view.compactingMap,
SSTableIntervalTree.build(sstableMap.keySet()));
}
};
} {code}
When a node is busy with compaction, the {{replaceFlushed}} update can easily
encounter contention. Assuming 1 thread doing memtable flush and 6 concurrent
compactors running full speed, these 7 threads will consume similar time (say T
ms) to generate the new View. There is only 1/7 chance that the flush can
succeed, which means the expectation finish time will be 7 * T ms...
What's worse, replaceFlushed is not only competing the View updates with
checkpoint calls from compaction. In some of my testing, I noticed that there
is a surge in unmarkCompaction calls every 1 hour (from default
index_summary_resize_interval = 60m) and this last for 2 minutes or longer.
During this 2 minutes windows, replaceFlushed finishTime is significantly
longer and we see pending mutationstage tasks starting to pile up (all of them
are waiting for memtable to be flushed) and hence writes started to timeout.
!image-2024-12-20-06-39-04-194.png|width=1259,height=292!
These unmarkCompacting calls was from
IndexSummaryRedistribution.redistributeSummaries(). If I understand it
correctly, what happens here is that
# IndexSummaryManager marks all SSTables as compacting, which is done in
one-go per cfs by iterating all cfs and add all SSTables to compactingMap
# in adjustSamplingLevels(), it calculates the sampling level for each sstable
(probably also did some updates on SSTableReader?)
# unmark compacting for the SSTables which don't need downsample
Step 3 is done *one by one* hence causing trouble here
{code:java}
if (remainingSpace > 0)
{
Pair<List<SSTableReader>, List<ResampleEntry>> result =
distributeRemainingSpace(toDownsample, remainingSpace);
toDownsample = result.right;
newSSTables.addAll(result.left);
for (SSTableReader sstable : result.left)
transactions.get(sstable.metadata().id).cancel(sstable);
} {code}
Here I have 2 proposals in improving this:
# Fix this redistributeSummaries() to group and unmark compacting for these
SSTables in one go
# Make replaceFlushed faster - by supporting addSSTables in IntervalTree. For
replaceFlushed calls we don't remove anything from the existing IntervalTree.
That being said, insert a new interval into the IntervalTree can also be fast
(at least better than rebuild the entire tree)
** The only concern here is that addSSTables might create imbalanced tree, but
we rebuild the tree very, very frequently - I think this should not be a huge
concern
Let me know what you think about this. I have a patch for Proposal 2 and am
still working on Proposal 1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]