[ 
https://issues.apache.org/jira/browse/CASSANDRA-20159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17907519#comment-17907519
 ] 

Yuqi Yan commented on CASSANDRA-20159:
--------------------------------------

Attached a 4.1 PR for proposal 1, will share the patch for proposal 2 later in 
separate ticket.

> memtable flush stuck for minutes on slow replaceFlushed view update
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-20159
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20159
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction, Local/SSTable
>            Reporter: Yuqi Yan
>            Assignee: Yuqi Yan
>            Priority: Normal
>             Fix For: 4.1.x
>
>         Attachments: image-2024-12-20-06-14-54-493.png, 
> image-2024-12-20-06-39-04-194.png
>
>
> We observed slow memtable flushing hence caused write latency spikes when the 
> cluster has large number of SSTables (~15K). I looked into the CPU profiling, 
> seems that most of the CPU was busy doing updates on the View from compaction.
> !image-2024-12-20-06-14-54-493.png|width=570,height=476!
> And taking a closer look - these checkpoint calls are mostly from 
> maybeReopenEarly. I don't fully understand how this early open mechanism 
> works, but according to some recent investigation, the checkpoint calls can 
> become expensive (observed in CASSANDRA-19596, CASSANDRA-20158)
> replaceFlushed update also requires an entire SSTableIntervalTree rebuild, 
> which can take significantly longer time when number of SSTable grows
>  
> {code:java}
> static Function<View, View> replaceFlushed(final Memtable memtable, final 
> Iterable<SSTableReader> flushed)
> {
>     return new Function<View, View>()
>     {
>         public View apply(View view)
>         {
>             List<Memtable> flushingMemtables = 
> copyOf(filter(view.flushingMemtables, not(equalTo(memtable))));
>             assert flushingMemtables.size() == view.flushingMemtables.size() 
> - 1;
>             if (flushed == null || Iterables.isEmpty(flushed))
>                 return new View(view.liveMemtables, flushingMemtables, 
> view.sstablesMap,
>                                 view.compactingMap, view.intervalTree);
>             Map<SSTableReader, SSTableReader> sstableMap = 
> replace(view.sstablesMap, emptySet(), flushed);
>             return new View(view.liveMemtables, flushingMemtables, 
> sstableMap, view.compactingMap,
>                             SSTableIntervalTree.build(sstableMap.keySet()));
>         }
>     };
> } {code}
> When a node is busy with compaction, the {{replaceFlushed}} update can easily 
> encounter contention. Assuming 1 thread doing memtable flush and 6 concurrent 
> compactors running full speed, these 7 threads will consume similar time (say 
> T ms) to generate the new View. There is only 1/7 chance that the flush can 
> succeed, which means the expectation finish time will be 7 * T ms...
>  
> What's worse, replaceFlushed is not only competing the View updates with 
> checkpoint calls from compaction. In some of my testing, I noticed that there 
> is a surge in unmarkCompaction calls every 1 hour (from default 
> index_summary_resize_interval = 60m) and this last for 2 minutes or longer. 
> During this 2 minutes windows, replaceFlushed finishTime is significantly 
> longer and we see pending mutationstage tasks starting to pile up (all of 
> them are waiting for memtable to be flushed) and hence writes started to 
> timeout.
> !image-2024-12-20-06-39-04-194.png|width=1259,height=292!
> These unmarkCompacting calls was from 
> IndexSummaryRedistribution.redistributeSummaries(). If I understand it 
> correctly, what happens here is that
>  # IndexSummaryManager marks all SSTables as compacting, which is done in 
> one-go per cfs by iterating all cfs and add all SSTables to compactingMap
>  # in adjustSamplingLevels(), it calculates the sampling level for each 
> sstable (probably also did some updates on SSTableReader?)
>  # unmark compacting for the SSTables which don't need downsample
> Step 3 is done *one by one* hence causing trouble here
> {code:java}
> if (remainingSpace > 0)
> {
>     Pair<List<SSTableReader>, List<ResampleEntry>> result = 
> distributeRemainingSpace(toDownsample, remainingSpace);
>     toDownsample = result.right;
>     newSSTables.addAll(result.left);
>     for (SSTableReader sstable : result.left)
>         transactions.get(sstable.metadata().id).cancel(sstable);
> } {code}
>  
> Here I have 2 proposals in improving this:
>  # Fix this redistributeSummaries() to group and unmark compacting for these 
> SSTables in one go
>  # Make replaceFlushed faster - by supporting addSSTables in IntervalTree. 
> For replaceFlushed calls we don't remove anything from the existing 
> IntervalTree. That being said, insert a new interval into the IntervalTree 
> can also be fast (at least better than rebuild the entire tree)
>  ** The only concern here is that addSSTables might create imbalanced tree, 
> but we rebuild the tree very, very frequently - I think this should not be a 
> huge concern
> Let me know what you think about this. I'm not very sure why this cancel was 
> done one by one, is it possible to do it in batch grouped by cfs (just like 
> how we mark them as compacting)?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to