[ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg reassigned CASSANDRA-12526:
------------------------------------------

    Assignee: Ariel Weisberg

> For LCS, single SSTable up-level is handled inefficiently
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-12526
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12526
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Compaction
>            Reporter: Wei Deng
>            Assignee: Ariel Weisberg
>            Priority: Major
>              Labels: compaction, lcs, performance
>             Fix For: 4.x
>
>
> I'm using the latest trunk (as of August 2016, which probably is going to be 
> 3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
> inefficiency.
> The test data is generated using cassandra-stress default parameters 
> (keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
> inserted partitions that will never merge in compactions, which is probably 
> the worst kind of workload for LCS (however, I'll detail later why this 
> scenario should not be ignored as a corner case; for now, let's just assume 
> we still want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that 
> match  the "Compacted" summary so that I can see how long each individual 
> compaction took and how many bytes they processed. The search pattern is like 
> the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as 
> having *only one* SSTable involved. With the workload mentioned above, the 
> "single SSTable" compactions actually consist of the majority of all 
> compactions (as shown below), so its efficiency can affect the overall 
> compaction throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit the 
> level of a particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
> targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> To be exact, I summed up the time spent for these single-SSTable compactions 
> (the total data size is 60GB) and found that if each compaction only needs to 
> spend 100ms for only the metadata change (instead of the 10+ second they're 
> doing now), it can already achieve 22.75% saving on total compaction time.
> Compared to what we have now (reading the whole single-SSTable from old level 
> and writing out the same single-SSTable at the new level), the only 
> difference I could think of by using this approach is that the new SSTable 
> will have the same file name (sequence number) as the old one's, which could 
> break some assumptions on some other part of the code. However, not having to 
> go through the full read/write IO, and not having to bear the overhead of 
> cleaning up the old file, creating the new file, creating more churns in heap 
> and file buffer, it seems the benefits outweigh the inconvenience. So I'd 
> argue this JIRA belongs to LHF and should be made available in 3.0.x as well.
> As mentioned in the 2nd paragraph, I'm also going to address why this kind of 
> all-new-partition workload should not be ignored as a corner case. Basically, 
> for the main use case of LCS where you need to frequently merge partitions to 
> optimize read and eliminate tombstones and expired data sooner, LCS can be 
> perfectly happy and efficiently perform the partition merge and tombstone 
> elimination for a long time. However, as soon as the node becomes a bit 
> unhealthy for various reasons (could be a bad disk so it's missing a whole 
> bunch of mutations and need repair, could be the user chooses to ingest way 
> more data than it usually takes and exceeds its capability, or god-forbidden, 
> some DBA chooses to run offline sstablelevelreset), you will have to handle 
> this kind of "all-new-partition with a lot of SSTables in L0" scenario, and 
> once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot 
> of such single-SSTable compactions, which is the situation this JIRA is 
> intended to address.
> Actually, when I think more about this, to make this kind of single SSTable 
> up-level more efficient will not only help the all-new-partition scenario, 
> but also help in general any time when there is a big backlog of L0 SSTables 
> due to too many flushes or excessive repair streaming with vnode. In those 
> situations, by default STCS_in_L0 will be triggered, and you will end up 
> getting a bunch of much bigger L0 SSTables after STCS is done. When it's time 
> to up-level those much bigger L0 SSTables most likely they will overlap among 
> themselves and you will add them all into your compaction session (along with 
> all overlapped L1 SSTables). For these much bigger L0 SSTables, they have 
> gone through a few rounds of STCS compactions, so if there's partition merge 
> that needs to be done because fragments of the same partition are dispersed 
> in smaller L0 SSTables earlier, after those STCS rounds, what you end up 
> having in those much bigger L0 SSTables (generated by STCS) will not have 
> much more opportunity for partition merge to happen, so we're in a scenario 
> very similar to L0 data "consists of a ton of newly inserted partitions that 
> will never merge in compactions" mentioned earlier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to