[
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517703#comment-15517703
]
Dikang Gu commented on CASSANDRA-12526:
---------------------------------------
I observed some single sstable compaction as well, haven't calculated the
percentage yet, but I feel it may worth to have a option to ignore the
compaction on single sstable. [~krummas] any thoughts?
> For LCS, single SSTable up-level is handled inefficiently
> ---------------------------------------------------------
>
> Key: CASSANDRA-12526
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12526
> Project: Cassandra
> Issue Type: Improvement
> Components: Compaction
> Reporter: Wei Deng
> Labels: compaction, lcs, performance
>
> I'm using the latest trunk (as of August 2016, which probably is going to be
> 3.10) to run some experiments on LeveledCompactionStrategy and noticed this
> inefficiency.
> The test data is generated using cassandra-stress default parameters
> (keyspace1.standard1), so as you can imagine, it consists of a ton of newly
> inserted partitions that will never merge in compactions, which is probably
> the worst kind of workload for LCS (however, I'll detail later why this
> scenario should not be ignored as a corner case; for now, let's just assume
> we still want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that
> match the "Compacted" summary so that I can see how long each individual
> compaction took and how many bytes they processed. The search pattern is like
> the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as
> having *only one* SSTable involved. With the workload mentioned above, the
> "single SSTable" compactions actually consist of the majority of all
> compactions (as shown below), so its efficiency can affect the overall
> compaction throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1'
> debug.log-test1 | wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1'
> debug.log-test1 | grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit the
> level of a particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor,
> targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> To be exact, I summed up the time spent for these single-SSTable compactions
> (the total data size is 60GB) and found that if each compaction only needs to
> spend 100ms for only the metadata change (instead of the 10+ second they're
> doing now), it can already achieve 22.75% saving on total compaction time.
> Compared to what we have now (reading the whole single-SSTable from old level
> and writing out the same single-SSTable at the new level), the only
> difference I could think of by using this approach is that the new SSTable
> will have the same file name (sequence number) as the old one's, which could
> break some assumptions on some other part of the code. However, not having to
> go through the full read/write IO, and not having to bear the overhead of
> cleaning up the old file, creating the new file, creating more churns in heap
> and file buffer, it seems the benefits outweigh the inconvenience. So I'd
> argue this JIRA belongs to LHF and should be made available in 3.0.x as well.
> As mentioned in the 2nd paragraph, I'm also going to address why this kind of
> all-new-partition workload should not be ignored as a corner case. Basically,
> for the main use case of LCS where you need to frequently merge partitions to
> optimize read and eliminate tombstones and expired data sooner, LCS can be
> perfectly happy and efficiently perform the partition merge and tombstone
> elimination for a long time. However, as soon as the node becomes a bit
> unhealthy for various reasons (could be a bad disk so it's missing a whole
> bunch of mutations and need repair, could be the user chooses to ingest way
> more data than it usually takes and exceeds its capability, or god-forbidden,
> some DBA chooses to run offline sstablelevelreset), you will have to handle
> this kind of "all-new-partition with a lot of SSTables in L0" scenario, and
> once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot
> of such single-SSTable compactions, which is the situation this JIRA is
> intended to address.
> Actually, when I think more about this, to make this kind of single SSTable
> up-level more efficient will not only help the all-new-partition scenario,
> but also help in general any time when there is a big backlog of L0 SSTables
> due to too many flushes or excessive repair streaming with vnode. In those
> situations, by default STCS_in_L0 will be triggered, and you will end up
> getting a bunch of much bigger L0 SSTables after STCS is done. When it's time
> to up-level those much bigger L0 SSTables most likely they will overlap among
> themselves and you will add them all into your compaction session (along with
> all overlapped L1 SSTables). For these much bigger L0 SSTables, they have
> gone through a few rounds of STCS compactions, so if there's partition merge
> that needs to be done because fragments of the same partition are dispersed
> in smaller L0 SSTables earlier, after those STCS rounds, what you end up
> having in those much bigger L0 SSTables (generated by STCS) will not have
> much more opportunity for partition merge to happen, so we're in a scenario
> very similar to L0 data "consists of a ton of newly inserted partitions that
> will never merge in compactions" mentioned earlier.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)