Actually, the problem is related to CASSANDRA-11430 <https://issues.apache.org/jira/browse/CASSANDRA-11430>.
Before 2.2.6, the notification service did not work with newly deprecated repair methods, on which Reaper still currently relies. C* 2.2.6 and onwards are not affected by this problem and work fine with Reaper. We're working on switching to the new repair method for 2.2 and 3.0/3.x, which should be ready in a few days/weeks. When using incremental repair, watch out for CASSANDRA-11696 which was fixed in C* 2.1.15, 2.2.7, 3.0.8 and 3.8. In prior versions, unrepaired SSTables can be marked as repaired, and thus never be repaired. Cheers, On Wed, Jan 4, 2017 at 6:09 AM Bhuvan Rawal <bhu1ra...@gmail.com> wrote: > Hi Daniel, > > Looks like yours is a different case. If you're running incremental repair > for the first time it make take long time esp. if table is large. And > repair may seem to stuck even when things are working. > > You can try nodetool compactionstats when repair appears stuck, you'll > find a validation compaction happening if that's indeed the case. > > For the first incremental repair you can follow this doc, in further > repairs incremental repair should encounter very few sstables: > > https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html > > Regards, > Bhuvan > > > > On Jan 4, 2017 3:52 AM, "Daniel Kleviansky" <dan...@kleviansky.com> wrote: > > Hi Bhuvan, > > Thank you so very much for your detailed reply. > Just to ensure everyone is across the same information, and responses are > not duplicated across two different forums, I thought I'd share with the > mailing list that I've created a GitHub issue at: > https://github.com/thelastpickle/cassandra-reaper/issues/39 > > Kind regards, > Daniel > > On Wed, Jan 4, 2017 at 6:31 AM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote: > > Hi Daniel, > > We faced a similar issue during repair with reaper. We ran repair with > more repair threads than number of cassandra nodes. But on and off repair > was getting stuck and we had to do rolling restart of cluster or wait for > lock time to expire (~1hr). > > We had a look at the stuck repair, threadpools were getting stuck at > AntiEntropy stage. From the synchronized block in repair code it appeared > that per node at max 1 concurrent repair session per node is possible. > > According to > https://medium.com/@mlowicki/cassandra-reaper-introduction-ed73410492bf#.f0erygqpk > : > > Segment runner has protection mechanism to avoid overloading nodes using > two simple rules to postpone repair if: > > 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* (20 > by default) > *2. Node is already running repair job* > > We tried running reaper with number of threads less than number of nodes > (assuming reaper will not submit multiple segments to single cassandra > node) but still it was observed that multiple repair segments were going to > same node concurrently and threfore chances of nodes getting stuck in that > state was possible. Finally we settled with single repair thread in reaper > settings. Although takes a slightly more time but has completed > successfully numerous times. > > Thread Dump of cassandra server when repair was getting stuck: > > "*AntiEntropyStage:1" #159 daemon prio=5 os_prio=0 tid=0x00007f0fa16226a0 > nid=0x3c82 waiting for monitor entry [0x00007ee9eabaf000*] > java.lang.Thread.State: BLOCKED (*on object monitor*) > at > org.apache.cassandra.service.ActiveRepairService.removeParentRepairSession(ActiveRepairService.java:392) > - waiting to lock <0x000000067c083308> (a > org.apache.cassandra.service.ActiveRepairService) > at > org.apache.cassandra.service.ActiveRepairService.doAntiCompaction(ActiveRepairService.java:417) > at org.apache.cassandra.repair > .RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:145) > at org.apache.cassandra.net > .MessageDeliveryTask.run(MessageDeliveryTask.java:67) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > Hope it helps! > > Regards, > Bhuvan > > According to > https://medium.com/@mlowicki/cassandra-reaper-introduction-ed73410492bf#.f0erygqpk > : > > Segment runner has protection mechanism to avoid overloading nodes using > two simple rules to postpone repair if: > > 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* (20 > by default) > 2. Node is already running repair job > > > On Tue, Jan 3, 2017 at 11:16 AM, Alexander Dejanovski < > a...@thelastpickle.com> wrote: > > Hi Daniel, > > could you file a bug in the issue tracker ? > https://github.com/thelastpickle/cassandra-reaper/issues > > We'll figure out what's wrong and get your repairs running. > > Thanks ! > > On Tue, Jan 3, 2017 at 12:35 AM Daniel Kleviansky <dan...@kleviansky.com> > wrote: > > Hi everyone, > > Using The Last Pickle's fork of Reaper, and unfortunately running into a > bit of an issue. I'll try break it down below. > > # Problem Description: > * After starting repair via the GUI, progress remains at 0/x. > * Cassandra nodes calculate their respective token ranges, and then > nothing happens. > * There were no errors in the Reaper or Cassandra logs. Only a message of > acknowledgement that a repair had initiated. > * Performing stack trace on the running JVM, once can see that the thread > spawning the repair process was waiting on a lock that was never being > released. > * This occurred on all nodes, and prevented any manually initiated repair > process from running. A rolling restart of each node was required, after > which one could run a `nodetool repair` successfully. > > # Cassandra Cluster Details: > * Cassandra 2.2.5 running on Windows Server 2008 R2 > * 6 node cluster, split across 2 DCs, with RF = 3:3. > > # Reaper Details: > * Reaper 0.3.3 running on Windows Server 2008 R2, utilising a PostgreSQL > database. > > ## Reaper settings: > * Parallism: DC-Aware > * Repair Intensity: 0.9 > * Incremental: true > > Don't want to swamp you with more details or unnecessary logs, especially > as I'd have to sanitize them before sending them out, so please let me know > if there is anything else I can provide, and I'll do my best to get it to > you. > > Kind regards, > Daniel > > -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > > > > > -- > Daniel Kleviansky > System Engineer & CX Consultant > M: +61 (0) 499 103 043 <+61%20499%20103%20043> | E: dan...@kleviansky.com > | W: http://danielkleviansky.com > > > -- ----------------- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com