[
https://issues.apache.org/jira/browse/CASSANDRA-12200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Jirsa reassigned CASSANDRA-12200:
--------------------------------------
Assignee: Jeff Jirsa
> Backlogged compactions can make repair on trivially small tables waiting for
> a long time to finish
> --------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-12200
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12200
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Wei Deng
> Assignee: Jeff Jirsa
>
> In C* 3.0 we started to use incremental repair by default. However, this
> seems to create a repair performance problem if you have a relatively
> write-heavy workload that can drive all available concurrent_compactors to be
> used by active compactions.
> I was able to demonstrate this issue by the following scenario:
> 1. On a three-node C* 3.0.7 cluster, use "cassandra-stress write n=100000000"
> to generate 100GB of data with keyspace1.standard1 table using LCS (ctrl+c
> the stress client once the data size on each node reaches 35+GB).
> 2. At this point, there will be hundreds of L0 SSTables waiting for LCS to
> digest on each node, and with concurrent_compactors set to default at 2, the
> two compaction threads are constantly busy processing the backlogged L0
> SSTables.
> 3. Now create a new keyspace called "trivial_ks" with RF=3 and create a small
> two-column CQL table in it, and insert 6 records.
> 4. Start a "nodetool repair trivial_ks" session on one of the nodes, and
> watch the following behavior:
> {noformat}
> automaton@wdengdse50google-98425b985-3:~$ nodetool repair trivial_ks
> [2016-07-13 01:57:28,364] Starting repair command #1, repairing keyspace
> trivial_ks with repair options (parallelism: parallel, primary range: false,
> incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [],
> hosts: [], # of ranges: 3)
> [2016-07-13 01:57:31,027] Repair session 27212dd0-489d-11e6-a6d6-cd06faa0aaa2
> for range [(3074457345618258602,-9223372036854775808],
> (-9223372036854775808,-3074457345618258603],
> (-3074457345618258603,3074457345618258602]] finished (progress: 66%)
> [2016-07-13 02:07:47,637] Repair completed successfully
> [2016-07-13 02:07:47,657] Repair command #1 finished in 10 minutes 19 seconds
> {noformat}
> Basically for such a small table it took 10+ minutes to finish the repair.
> Looking at debug.log for this particular repair session UUID, you will find
> that all nodes were able to pass through validation compaction within 15ms,
> but one of the nodes actually got stuck waiting for a compaction slot because
> it has to do an anti-compaction step before it can finally tell the
> initiating node that it's done with its part of the repair session, so it
> took 10+ minutes for one compaction slot to be freed up, like shown in the
> following debug.log entries:
> {noformat}
> DEBUG [AntiEntropyStage:1] 2016-07-13 01:57:30,956
> RepairMessageVerbHandler.java:149 - Got anticompaction request
> AnticompactionRequest{parentRepairSession=27103de0-489d-11e6-a6d6-cd06faa0aaa2}
> org.apache.cassandra.repair.messages.AnticompactionRequest@34449ff4
> <...>
> <snip>
> <...>
> DEBUG [CompactionExecutor:5] 2016-07-13 02:07:47,506 CompactionTask.java:217
> - Compacted (286609e0-489d-11e6-9e03-1fd69c5ec46c) 32 sstables to
> [/var/lib/cassandra/data/keyspace1/standard1-9c02e9c1487c11e6b9161dbd340a212f/mb-499-big,]
> to level=0. 2,892,058,050 bytes to 2,874,333,820 (~99% of original) in
> 616,880ms = 4.443617MB/s. 0 total partitions merged to 12,233,340.
> Partition merge counts were {1:12086760, 2:146580, }
> INFO [CompactionExecutor:5] 2016-07-13 02:07:47,512
> CompactionManager.java:511 - Starting anticompaction for trivial_ks.weitest
> on
> 1/[BigTableReader(path='/var/lib/cassandra/data/trivial_ks/weitest-538b07d1489b11e6a9ef61c6ff848952/mb-1-big-Data.db')]
> sstables
> INFO [CompactionExecutor:5] 2016-07-13 02:07:47,513
> CompactionManager.java:540 - SSTable
> BigTableReader(path='/var/lib/cassandra/data/trivial_ks/weitest-538b07d1489b11e6a9ef61c6ff848952/mb-1-big-Data.db')
> fully contained in range (-9223372036854775808,-9223372036854775808],
> mutating repairedAt instead of anticompacting
> INFO [CompactionExecutor:5] 2016-07-13 02:07:47,570
> CompactionManager.java:578 - Completed anticompaction successfully
> {noformat}
> Since validation compaction has its own threads outside of the regular
> compaction thread pool restricted by concurrent_compactors, we were able to
> pass through validation compaction without any issue. If we could treat
> anti-compaction the same way (i.e. to give it its own thread pool), we can
> avoid this kind of repair performance problem from happening.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)