Hi,

In reply to Dikang Gu:
For the run where we incorporated the change from CASSANDRA-11571 the stack trace was like this (from JMC):
*Stack Trace*   *Sample Count*  *Percentage(%)*
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(int) 229 11.983 -org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates() 228 11.931 --org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(int) 221 11.565 ---org.apache.cassandra.db.compaction.LeveledManifest.overlappingWithBounds(SSTableReader, Map) 201 10.518 ----org.apache.cassandra.db.compaction.LeveledManifest.overlappingWithBounds(Token, Token, Map) 201 10.518
-----org.apache.cassandra.dht.Bounds.intersects(Bounds)         141     7.378
-----java.util.HashSet.add(Object)      56      2.93


This is for one of the compaction executors during an interval of 1 minute and 24 seconds, but we saw similar behavior for other compaction threads as well. The full flight recording was 10 minutes and was started at the same time as the repair. The interval was taken from the end of the recording where the number of sstables had increased. During this interval this compaction thread used ~10% of the total CPU.

I agree that optimally there shouldn't be many sstables in L0 and except for when repair is running we don't have that many.

---

In reply to Jeff Jirsa/Nate McCall:
I might have been unclear about the compaction order in my first email, I meant to say that there is a check for STCS right before L1+, but only if a L1+ compaction is possible. We used version 2.2.7 for the test run so https://issues.apache.org/jira/browse/CASSANDRA-10979 should be included and have reduced some of the backlog of L0.

Correct me if I'm wrong but my interpretation of the scenario that Sylvain describes in https://issues.apache.org/jira/browse/CASSANDRA-5371 is when you either almost constantly have 32+ SSTables in L0 or are close to it. My guess is that this could be applied to having constant load during a certain timespan as well. So when you get more than 32 sstables you start to do STCS which in turn creates larger sstables which might span the whole of L1. Then when these sstables should be promoted to L1 it re-writes the whole L1 which creates a larger backlog in L0. So then the number of sstables keeps rising and trigger a STCS again, and complete the circle. Based on this interpretation it seems to me that if the write pattern into L0 is "random" this might happen regardless if a STCS compaction has occurred or not.

If my interpretation is correct it might be better to choose a higher number of sstables before STCS starts in L0 and make it configurable. With a reduced complexity it could be something like this:
1. Perform STCS in L0 if we have above X(1000?) sstables in L0.
2. Check L1+
3. Check for L0->L1

It should be possible to keep the current logic as well and only add a configurable check before (step 1) to avoid the overlapping check with larger backlogs. Another alternative might be https://issues.apache.org/jira/browse/CASSANDRA-7409 and allow overlapping sstables in more levels than L0. If it can quickly push sorted data to L1 it might remove the need for STCS in LCS. The previously mentioned potential cost of the overlapping check would still be there if we have a large backlog, but the approach might reduce the risk of getting into the situation. I'll try to get some time to run a test with CASSANDRA-7409 in our test cluster.

BR
Marcus O

On 11/28/2016 06:48 PM, Eric Evans wrote:
On Sat, Nov 26, 2016 at 6:30 PM, Dikang Gu<dikan...@gmail.com>  wrote:
Hi Marcus,

Do you have some stack trace to show that which function in the `
getNextBackgroundTask` is most expensive?

Yeah, I think having 15-20K sstables in L0 is very bad, in our heavy-write
cluster, I try my best to reduce the impact of repair, and keep number of
sstables in L0 < 100.

Thanks
Dikang.

On Thu, Nov 24, 2016 at 12:53 PM, Nate McCall<zznat...@gmail.com>  wrote:

The reason is described here:
https://issues.apache.org/jira/browse/CASSANDRA-5371?
focusedCommentId=13621679&page=com.atlassian.jira.
plugin.system.issuetabpanels:comment-tabpanel#comment-13621679
/Marcus
"...a lot of the work you've done you will redo when you compact your now
bigger L0 sstable against L1."

^ Sylvain's hypothesis (next comment down) is actually something we see
occasionally in practice: having to re-write the contents of L1 too often
when large L0 SSTables are pulled in. Here is an example we took on a
system with pending compaction spikes that was seeing this specific issue
with four LCS-based tables:

https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950

The significant part of this particular workload is a burst of heavy writes
from long-duration scheduled jobs.


--
Dikang


Reply via email to