Hi Daniel,

Looks like yours is a different case. If you're running incremental repair
for the first time it make take long time esp. if table is large. And
repair may seem to stuck even when things are working.

You can try nodetool compactionstats when repair appears stuck, you'll find
a validation compaction happening if that's indeed the case.

For the first incremental repair you can follow this doc, in further
repairs incremental repair should encounter very few sstables:
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html

Regards,
Bhuvan



On Jan 4, 2017 3:52 AM, "Daniel Kleviansky" <dan...@kleviansky.com> wrote:

Hi Bhuvan,

Thank you so very much for your detailed reply.
Just to ensure everyone is across the same information, and responses are
not duplicated across two different forums, I thought I'd share with the
mailing list that I've created a GitHub issue at: https://github.com/
thelastpickle/cassandra-reaper/issues/39

Kind regards,
Daniel

On Wed, Jan 4, 2017 at 6:31 AM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:

> Hi Daniel,
>
> We faced a similar issue during repair with reaper. We ran repair with
> more repair threads than number of cassandra nodes. But on and off repair
> was getting stuck and we had to do rolling restart of cluster or wait for
> lock time to expire (~1hr).
>
> We had a look at the stuck repair, threadpools were getting stuck at
> AntiEntropy stage. From the synchronized block in repair code it appeared
> that per node at max 1 concurrent repair session per node is possible.
>
> According to https://medium.com/@mlowicki/cassandra-reaper-introductio
> n-ed73410492bf#.f0erygqpk :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS*
>  (20 by default)
> *2. Node is already running repair job*
>
> We tried running reaper with number of threads less than number of nodes
> (assuming reaper will not submit multiple segments to single cassandra
> node) but still it was observed that multiple repair segments were going to
> same node concurrently and threfore chances of nodes getting stuck in that
> state was possible. Finally we settled with single repair thread in reaper
> settings. Although takes a slightly more time but has completed
> successfully numerous times.
>
> Thread Dump of cassandra server when repair was getting stuck:
>
> "*AntiEntropyStage:1" #159 daemon prio=5 os_prio=0 tid=0x00007f0fa16226a0
> nid=0x3c82 waiting for monitor entry [0x00007ee9eabaf000*]
>    java.lang.Thread.State: BLOCKED (*on object monitor*)
>         at org.apache.cassandra.service.ActiveRepairService.removeParen
> tRepairSession(ActiveRepairService.java:392)
>         - waiting to lock <0x000000067c083308> (a
> org.apache.cassandra.service.ActiveRepairService)
>         at org.apache.cassandra.service.ActiveRepairService.doAntiCompa
> ction(ActiveRepairService.java:417)
>         at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(
> RepairMessageVerbHandler.java:145)
>         at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeli
> veryTask.java:67)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executor
> s.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> Executor.java:1142)
>
> Hope it helps!
>
> Regards,
> Bhuvan
>
> According to https://medium.com/@mlowicki/cassandra-reaper-introductio
> n-ed73410492bf#.f0erygqpk :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS*
>  (20 by default)
> 2. Node is already running repair job
>
>
> On Tue, Jan 3, 2017 at 11:16 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi Daniel,
>>
>> could you file a bug in the issue tracker ? https://github.com/thelastpi
>> ckle/cassandra-reaper/issues
>>
>> We'll figure out what's wrong and get your repairs running.
>>
>> Thanks !
>>
>> On Tue, Jan 3, 2017 at 12:35 AM Daniel Kleviansky <dan...@kleviansky.com>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Using The Last Pickle's fork of Reaper, and unfortunately running into a
>>> bit of an issue. I'll try break it down below.
>>>
>>> # Problem Description:
>>> * After starting repair via the GUI, progress remains at 0/x.
>>> * Cassandra nodes calculate their respective token ranges, and then
>>> nothing happens.
>>> * There were no errors in the Reaper or Cassandra logs. Only a message
>>> of acknowledgement that a repair had initiated.
>>> * Performing stack trace on the running JVM, once can see that the
>>> thread spawning the repair process was waiting on a lock that was never
>>> being released.
>>> * This occurred on all nodes, and prevented any manually initiated
>>> repair process from running. A rolling restart of each node was required,
>>> after which one could run a `nodetool repair` successfully.
>>>
>>> # Cassandra Cluster Details:
>>> * Cassandra 2.2.5 running on Windows Server 2008 R2
>>> * 6 node cluster, split across 2 DCs, with RF = 3:3.
>>>
>>> # Reaper Details:
>>> * Reaper 0.3.3 running on Windows Server 2008 R2, utilising a PostgreSQL
>>> database.
>>>
>>> ## Reaper settings:
>>> * Parallism: DC-Aware
>>> * Repair Intensity: 0.9
>>> * Incremental: true
>>>
>>> Don't want to swamp you with more details or unnecessary logs,
>>> especially as I'd have to sanitize them before sending them out, so please
>>> let me know if there is anything else I can provide, and I'll do my best to
>>> get it to you.
>>>
>>> ​Kind regards,
>>> Daniel
>>>
>> --
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>


-- 
Daniel Kleviansky
System Engineer & CX Consultant
M: +61 (0) 499 103 043 | E: dan...@kleviansky.com | W:
http://danielkleviansky.com

Reply via email to