Wei Deng created CASSANDRA-12615:
------------------------------------
Summary: Improve LCS compaction concurrency during L0->L1
compaction
Key: CASSANDRA-12615
URL: https://issues.apache.org/jira/browse/CASSANDRA-12615
Project: Cassandra
Issue Type: Improvement
Components: Compaction
Reporter: Wei Deng
Attachments: L0_L1_inefficiency.jpg
I've done multiple experiments with {{compaction-stress}} at 100GB, 200GB,
400GB and 600GB levels. These scenarios share a common pattern: at the
beginning of the compaction they all overwhelm L0 with a lot (hundreds to
thousands) of 128MB SSTables. One common observation I noticed from visualizing
the compaction.log files from these tests is that initially after some massive
STCS-in-L0 activities (could take up to 40% of total compaction time), L0->L1
always takes a really long time which frequently involves all of the bigger L0
SSTables (the results of many STCS compactions earlier) and all of the 10 L1
SSTables, and the output covers almost the full data set. Since L0->L1 can only
happen single-threaded, we often spend close to 40% of the total compaction
time in this L0->L1 stage, and only after this first really long L0->L1
finishes and 100s or 1000s of SSTables land on L1, can concurrent compactions
at higher levels resume (to move the thousands of L1 SSTables to higher
levels). The attached snapshot demonstrates this observation.
The question is, if this L0->L1 compaction is so big and can only happen
single-threaded, and ends up generating thousands of L1 SSTables, most of which
will have to up-level later anyway (as L1 can accommodate at most 10 SSTables),
why not start that L1+ up-level earlier, i.e. before this L0->L1 compaction
finishes.
I can think of a few approaches: 1) break L0->L1 into smaller chunks if it
realizes that the output of such L0->L1 compaction is going to far exceed the
capacity of L1, this will allow each L0->L1 to finish sooner, and have the
resulting L1 SSTables to be able to participate in higher up-level activities;
2) still treating the full L0->L1 as one big compaction session, but making the
intermediate results (once the number of L1 SSTable output exceeds the L1
capacity) available for higher up-level activities.
If we can somehow leverage more threads during this massive L0->L1 phase, we
can save close to 40% of the total compaction time when L0 is initially
backlogged, which will be a great improvement to our LCS compaction throughput.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)