Hi Bhuvan, Thank you so very much for your detailed reply. Just to ensure everyone is across the same information, and responses are not duplicated across two different forums, I thought I'd share with the mailing list that I've created a GitHub issue at: https://github.com/thelastpickle/cassandra-reaper/issues/39
Kind regards, Daniel On Wed, Jan 4, 2017 at 6:31 AM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote: > Hi Daniel, > > We faced a similar issue during repair with reaper. We ran repair with > more repair threads than number of cassandra nodes. But on and off repair > was getting stuck and we had to do rolling restart of cluster or wait for > lock time to expire (~1hr). > > We had a look at the stuck repair, threadpools were getting stuck at > AntiEntropy stage. From the synchronized block in repair code it appeared > that per node at max 1 concurrent repair session per node is possible. > > According to https://medium.com/@mlowicki/cassandra-reaper- > introduction-ed73410492bf#.f0erygqpk : > > Segment runner has protection mechanism to avoid overloading nodes using > two simple rules to postpone repair if: > > 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* > (20 by default) > *2. Node is already running repair job* > > We tried running reaper with number of threads less than number of nodes > (assuming reaper will not submit multiple segments to single cassandra > node) but still it was observed that multiple repair segments were going to > same node concurrently and threfore chances of nodes getting stuck in that > state was possible. Finally we settled with single repair thread in reaper > settings. Although takes a slightly more time but has completed > successfully numerous times. > > Thread Dump of cassandra server when repair was getting stuck: > > "*AntiEntropyStage:1" #159 daemon prio=5 os_prio=0 tid=0x00007f0fa16226a0 > nid=0x3c82 waiting for monitor entry [0x00007ee9eabaf000*] > java.lang.Thread.State: BLOCKED (*on object monitor*) > at org.apache.cassandra.service.ActiveRepairService.removeParen > tRepairSession(ActiveRepairService.java:392) > - waiting to lock <0x000000067c083308> (a > org.apache.cassandra.service.ActiveRepairService) > at org.apache.cassandra.service.ActiveRepairService.doAntiCompa > ction(ActiveRepairService.java:417) > at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb( > RepairMessageVerbHandler.java:145) > at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeli > veryTask.java:67) > at java.util.concurrent.Executors$RunnableAdapter.call( > Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool > Executor.java:1142) > > Hope it helps! > > Regards, > Bhuvan > > According to https://medium.com/@mlowicki/cassandra-reaper- > introduction-ed73410492bf#.f0erygqpk : > > Segment runner has protection mechanism to avoid overloading nodes using > two simple rules to postpone repair if: > > 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* > (20 by default) > 2. Node is already running repair job > > > On Tue, Jan 3, 2017 at 11:16 AM, Alexander Dejanovski < > a...@thelastpickle.com> wrote: > >> Hi Daniel, >> >> could you file a bug in the issue tracker ? https://github.com/thelastpi >> ckle/cassandra-reaper/issues >> >> We'll figure out what's wrong and get your repairs running. >> >> Thanks ! >> >> On Tue, Jan 3, 2017 at 12:35 AM Daniel Kleviansky <dan...@kleviansky.com> >> wrote: >> >>> Hi everyone, >>> >>> Using The Last Pickle's fork of Reaper, and unfortunately running into a >>> bit of an issue. I'll try break it down below. >>> >>> # Problem Description: >>> * After starting repair via the GUI, progress remains at 0/x. >>> * Cassandra nodes calculate their respective token ranges, and then >>> nothing happens. >>> * There were no errors in the Reaper or Cassandra logs. Only a message >>> of acknowledgement that a repair had initiated. >>> * Performing stack trace on the running JVM, once can see that the >>> thread spawning the repair process was waiting on a lock that was never >>> being released. >>> * This occurred on all nodes, and prevented any manually initiated >>> repair process from running. A rolling restart of each node was required, >>> after which one could run a `nodetool repair` successfully. >>> >>> # Cassandra Cluster Details: >>> * Cassandra 2.2.5 running on Windows Server 2008 R2 >>> * 6 node cluster, split across 2 DCs, with RF = 3:3. >>> >>> # Reaper Details: >>> * Reaper 0.3.3 running on Windows Server 2008 R2, utilising a PostgreSQL >>> database. >>> >>> ## Reaper settings: >>> * Parallism: DC-Aware >>> * Repair Intensity: 0.9 >>> * Incremental: true >>> >>> Don't want to swamp you with more details or unnecessary logs, >>> especially as I'd have to sanitize them before sending them out, so please >>> let me know if there is anything else I can provide, and I'll do my best to >>> get it to you. >>> >>> Kind regards, >>> Daniel >>> >> -- >> ----------------- >> Alexander Dejanovski >> France >> @alexanderdeja >> >> Consultant >> Apache Cassandra Consulting >> http://www.thelastpickle.com >> > > -- Daniel Kleviansky System Engineer & CX Consultant M: +61 (0) 499 103 043 | E: dan...@kleviansky.com | W: http://danielkleviansky.com