Re: STCS in L0 behaviour

Marcus Olsson Fri, 02 Dec 2016 10:35:22 -0800

Hi,

In reply to Dikang Gu:

For the run where we incorporated the change from CASSANDRA-11571 thestack trace was like this (from JMC):

*Stack Trace*   *Sample Count*  *Percentage(%)*

org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(int)229 11.983-org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates()228 11.931--org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(int)221 11.565---org.apache.cassandra.db.compaction.LeveledManifest.overlappingWithBounds(SSTableReader,Map) 201 10.518----org.apache.cassandra.db.compaction.LeveledManifest.overlappingWithBounds(Token,Token, Map) 201 10.518

-----org.apache.cassandra.dht.Bounds.intersects(Bounds)         141     7.378
-----java.util.HashSet.add(Object)      56      2.93

This is for one of the compaction executors during an interval of 1minute and 24 seconds, but we saw similar behavior for other compactionthreads as well. The full flight recording was 10 minutes and wasstarted at the same time as the repair. The interval was taken from theend of the recording where the number of sstables had increased. Duringthis interval this compaction thread used ~10% of the total CPU.

I agree that optimally there shouldn't be many sstables in L0 and exceptfor when repair is running we don't have that many.


---

In reply to Jeff Jirsa/Nate McCall:

I might have been unclear about the compaction order in my first email,I meant to say that there is a check for STCS right before L1+, but onlyif a L1+ compaction is possible. We used version 2.2.7 for the test runso https://issues.apache.org/jira/browse/CASSANDRA-10979 should beincluded and have reduced some of the backlog of L0.

Correct me if I'm wrong but my interpretation of the scenario thatSylvain describes inhttps://issues.apache.org/jira/browse/CASSANDRA-5371 is when you eitheralmost constantly have 32+ SSTables in L0 or are close to it. My guessis that this could be applied to having constant load during a certaintimespan as well. So when you get more than 32 sstables you start to doSTCS which in turn creates larger sstables which might span the whole ofL1. Then when these sstables should be promoted to L1 it re-writes thewhole L1 which creates a larger backlog in L0. So then the number ofsstables keeps rising and trigger a STCS again, and complete the circle.Based on this interpretation it seems to me that if the write patterninto L0 is "random" this might happen regardless if a STCS compactionhas occurred or not.

If my interpretation is correct it might be better to choose a highernumber of sstables before STCS starts in L0 and make it configurable.With a reduced complexity it could be something like this:

1. Perform STCS in L0 if we have above X(1000?) sstables in L0.
2. Check L1+
3. Check for L0->L1

It should be possible to keep the current logic as well and only add aconfigurable check before (step 1) to avoid the overlapping check withlarger backlogs. Another alternative might behttps://issues.apache.org/jira/browse/CASSANDRA-7409 and allowoverlapping sstables in more levels than L0. If it can quickly pushsorted data to L1 it might remove the need for STCS in LCS. Thepreviously mentioned potential cost of the overlapping check would stillbe there if we have a large backlog, but the approach might reduce therisk of getting into the situation. I'll try to get some time to run atest with CASSANDRA-7409 in our test cluster.


BR
Marcus O

On 11/28/2016 06:48 PM, Eric Evans wrote:

On Sat, Nov 26, 2016 at 6:30 PM, Dikang Gu<dikan...@gmail.com>  wrote:

Hi Marcus,

Do you have some stack trace to show that which function in the `
getNextBackgroundTask` is most expensive?

Yeah, I think having 15-20K sstables in L0 is very bad, in our heavy-write
cluster, I try my best to reduce the impact of repair, and keep number of
sstables in L0 < 100.

Thanks
Dikang.

On Thu, Nov 24, 2016 at 12:53 PM, Nate McCall<zznat...@gmail.com>  wrote:

The reason is described here:

https://issues.apache.org/jira/browse/CASSANDRA-5371?
focusedCommentId=13621679&page=com.atlassian.jira.
plugin.system.issuetabpanels:comment-tabpanel#comment-13621679

/Marcus

"...a lot of the work you've done you will redo when you compact your now
bigger L0 sstable against L1."

^ Sylvain's hypothesis (next comment down) is actually something we see
occasionally in practice: having to re-write the contents of L1 too often
when large L0 SSTables are pulled in. Here is an example we took on a
system with pending compaction spikes that was seeing this specific issue
with four LCS-based tables:

https://gist.github.com/zznate/d22812551fa7a527d4c0d931f107c950

The significant part of this particular workload is a burst of heavy writes
from long-duration scheduled jobs.


--
Dikang

Re: STCS in L0 behaviour

Reply via email to