Hi Bhuvan,

Thank you so very much for your detailed reply.
Just to ensure everyone is across the same information, and responses are
not duplicated across two different forums, I thought I'd share with the
mailing list that I've created a GitHub issue at:
https://github.com/thelastpickle/cassandra-reaper/issues/39

Kind regards,
Daniel

On Wed, Jan 4, 2017 at 6:31 AM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:

> Hi Daniel,
>
> We faced a similar issue during repair with reaper. We ran repair with
> more repair threads than number of cassandra nodes. But on and off repair
> was getting stuck and we had to do rolling restart of cluster or wait for
> lock time to expire (~1hr).
>
> We had a look at the stuck repair, threadpools were getting stuck at
> AntiEntropy stage. From the synchronized block in repair code it appeared
> that per node at max 1 concurrent repair session per node is possible.
>
> According to https://medium.com/@mlowicki/cassandra-reaper-
> introduction-ed73410492bf#.f0erygqpk :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS*
>  (20 by default)
> *2. Node is already running repair job*
>
> We tried running reaper with number of threads less than number of nodes
> (assuming reaper will not submit multiple segments to single cassandra
> node) but still it was observed that multiple repair segments were going to
> same node concurrently and threfore chances of nodes getting stuck in that
> state was possible. Finally we settled with single repair thread in reaper
> settings. Although takes a slightly more time but has completed
> successfully numerous times.
>
> Thread Dump of cassandra server when repair was getting stuck:
>
> "*AntiEntropyStage:1" #159 daemon prio=5 os_prio=0 tid=0x00007f0fa16226a0
> nid=0x3c82 waiting for monitor entry [0x00007ee9eabaf000*]
>    java.lang.Thread.State: BLOCKED (*on object monitor*)
>         at org.apache.cassandra.service.ActiveRepairService.removeParen
> tRepairSession(ActiveRepairService.java:392)
>         - waiting to lock <0x000000067c083308> (a
> org.apache.cassandra.service.ActiveRepairService)
>         at org.apache.cassandra.service.ActiveRepairService.doAntiCompa
> ction(ActiveRepairService.java:417)
>         at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(
> RepairMessageVerbHandler.java:145)
>         at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeli
> veryTask.java:67)
>         at java.util.concurrent.Executors$RunnableAdapter.call(
> Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> Executor.java:1142)
>
> Hope it helps!
>
> Regards,
> Bhuvan
>
> According to https://medium.com/@mlowicki/cassandra-reaper-
> introduction-ed73410492bf#.f0erygqpk :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS*
>  (20 by default)
> 2. Node is already running repair job
>
>
> On Tue, Jan 3, 2017 at 11:16 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi Daniel,
>>
>> could you file a bug in the issue tracker ? https://github.com/thelastpi
>> ckle/cassandra-reaper/issues
>>
>> We'll figure out what's wrong and get your repairs running.
>>
>> Thanks !
>>
>> On Tue, Jan 3, 2017 at 12:35 AM Daniel Kleviansky <dan...@kleviansky.com>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Using The Last Pickle's fork of Reaper, and unfortunately running into a
>>> bit of an issue. I'll try break it down below.
>>>
>>> # Problem Description:
>>> * After starting repair via the GUI, progress remains at 0/x.
>>> * Cassandra nodes calculate their respective token ranges, and then
>>> nothing happens.
>>> * There were no errors in the Reaper or Cassandra logs. Only a message
>>> of acknowledgement that a repair had initiated.
>>> * Performing stack trace on the running JVM, once can see that the
>>> thread spawning the repair process was waiting on a lock that was never
>>> being released.
>>> * This occurred on all nodes, and prevented any manually initiated
>>> repair process from running. A rolling restart of each node was required,
>>> after which one could run a `nodetool repair` successfully.
>>>
>>> # Cassandra Cluster Details:
>>> * Cassandra 2.2.5 running on Windows Server 2008 R2
>>> * 6 node cluster, split across 2 DCs, with RF = 3:3.
>>>
>>> # Reaper Details:
>>> * Reaper 0.3.3 running on Windows Server 2008 R2, utilising a PostgreSQL
>>> database.
>>>
>>> ## Reaper settings:
>>> * Parallism: DC-Aware
>>> * Repair Intensity: 0.9
>>> * Incremental: true
>>>
>>> Don't want to swamp you with more details or unnecessary logs,
>>> especially as I'd have to sanitize them before sending them out, so please
>>> let me know if there is anything else I can provide, and I'll do my best to
>>> get it to you.
>>>
>>> ​Kind regards,
>>> Daniel
>>>
>> --
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>


-- 
Daniel Kleviansky
System Engineer & CX Consultant
M: +61 (0) 499 103 043 | E: dan...@kleviansky.com | W:
http://danielkleviansky.com

Reply via email to