Re: Reaper repair seems to "hang"

2017-01-04 Thread Alexander Dejanovski
Actually, the problem is related to CASSANDRA-11430
.

Before 2.2.6, the notification service did not work with newly deprecated
repair methods, on which Reaper still currently relies.
C* 2.2.6 and onwards are not affected by this problem and work fine with
Reaper.

We're working on switching to the new repair method for 2.2 and 3.0/3.x,
which should be ready in a few days/weeks.

When using incremental repair, watch out for CASSANDRA-11696 which was
fixed in C* 2.1.15, 2.2.7, 3.0.8 and 3.8. In prior versions, unrepaired
SSTables can be marked as repaired, and thus never be repaired.

Cheers,



On Wed, Jan 4, 2017 at 6:09 AM Bhuvan Rawal  wrote:

> Hi Daniel,
>
> Looks like yours is a different case. If you're running incremental repair
> for the first time it make take long time esp. if table is large. And
> repair may seem to stuck even when things are working.
>
> You can try nodetool compactionstats when repair appears stuck, you'll
> find a validation compaction happening if that's indeed the case.
>
> For the first incremental repair you can follow this doc, in further
> repairs incremental repair should encounter very few sstables:
>
> https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
>
> Regards,
> Bhuvan
>
>
>
> On Jan 4, 2017 3:52 AM, "Daniel Kleviansky"  wrote:
>
> Hi Bhuvan,
>
> Thank you so very much for your detailed reply.
> Just to ensure everyone is across the same information, and responses are
> not duplicated across two different forums, I thought I'd share with the
> mailing list that I've created a GitHub issue at:
> https://github.com/thelastpickle/cassandra-reaper/issues/39
>
> Kind regards,
> Daniel
>
> On Wed, Jan 4, 2017 at 6:31 AM, Bhuvan Rawal  wrote:
>
> Hi Daniel,
>
> We faced a similar issue during repair with reaper. We ran repair with
> more repair threads than number of cassandra nodes. But on and off repair
> was getting stuck and we had to do rolling restart of cluster or wait for
> lock time to expire (~1hr).
>
> We had a look at the stuck repair, threadpools were getting stuck at
> AntiEntropy stage. From the synchronized block in repair code it appeared
> that per node at max 1 concurrent repair session per node is possible.
>
> According to
> https://medium.com/@mlowicki/cassandra-reaper-introduction-ed73410492bf#.f0erygqpk
>  :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* (20
> by default)
> *2. Node is already running repair job*
>
> We tried running reaper with number of threads less than number of nodes
> (assuming reaper will not submit multiple segments to single cassandra
> node) but still it was observed that multiple repair segments were going to
> same node concurrently and threfore chances of nodes getting stuck in that
> state was possible. Finally we settled with single repair thread in reaper
> settings. Although takes a slightly more time but has completed
> successfully numerous times.
>
> Thread Dump of cassandra server when repair was getting stuck:
>
> "*AntiEntropyStage:1" #159 daemon prio=5 os_prio=0 tid=0x7f0fa16226a0
> nid=0x3c82 waiting for monitor entry [0x7ee9eabaf000*]
>java.lang.Thread.State: BLOCKED (*on object monitor*)
> at
> org.apache.cassandra.service.ActiveRepairService.removeParentRepairSession(ActiveRepairService.java:392)
> - waiting to lock <0x00067c083308> (a
> org.apache.cassandra.service.ActiveRepairService)
> at
> org.apache.cassandra.service.ActiveRepairService.doAntiCompaction(ActiveRepairService.java:417)
> at org.apache.cassandra.repair
> .RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:145)
> at org.apache.cassandra.net
> .MessageDeliveryTask.run(MessageDeliveryTask.java:67)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> Hope it helps!
>
> Regards,
> Bhuvan
>
> According to
> https://medium.com/@mlowicki/cassandra-reaper-introduction-ed73410492bf#.f0erygqpk
>  :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* (20
> by default)
> 2. Node is already running repair job
>
>
> On Tue, Jan 3, 2017 at 11:16 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
> Hi Daniel,
>
> could you file a bug in the issue tracker ?
> https://github.com/thelastpickle/cassandra-reaper/issues
>
> We'll figure out what's wrong and get your repairs running.
>
> Thanks !
>
> On Tue, Jan 3, 2017 at 12:35 

Re: Reaper repair seems to "hang"

2017-01-03 Thread Bhuvan Rawal
Hi Daniel,

Looks like yours is a different case. If you're running incremental repair
for the first time it make take long time esp. if table is large. And
repair may seem to stuck even when things are working.

You can try nodetool compactionstats when repair appears stuck, you'll find
a validation compaction happening if that's indeed the case.

For the first incremental repair you can follow this doc, in further
repairs incremental repair should encounter very few sstables:
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html

Regards,
Bhuvan



On Jan 4, 2017 3:52 AM, "Daniel Kleviansky"  wrote:

Hi Bhuvan,

Thank you so very much for your detailed reply.
Just to ensure everyone is across the same information, and responses are
not duplicated across two different forums, I thought I'd share with the
mailing list that I've created a GitHub issue at: https://github.com/
thelastpickle/cassandra-reaper/issues/39

Kind regards,
Daniel

On Wed, Jan 4, 2017 at 6:31 AM, Bhuvan Rawal  wrote:

> Hi Daniel,
>
> We faced a similar issue during repair with reaper. We ran repair with
> more repair threads than number of cassandra nodes. But on and off repair
> was getting stuck and we had to do rolling restart of cluster or wait for
> lock time to expire (~1hr).
>
> We had a look at the stuck repair, threadpools were getting stuck at
> AntiEntropy stage. From the synchronized block in repair code it appeared
> that per node at max 1 concurrent repair session per node is possible.
>
> According to https://medium.com/@mlowicki/cassandra-reaper-introductio
> n-ed73410492bf#.f0erygqpk :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS*
>  (20 by default)
> *2. Node is already running repair job*
>
> We tried running reaper with number of threads less than number of nodes
> (assuming reaper will not submit multiple segments to single cassandra
> node) but still it was observed that multiple repair segments were going to
> same node concurrently and threfore chances of nodes getting stuck in that
> state was possible. Finally we settled with single repair thread in reaper
> settings. Although takes a slightly more time but has completed
> successfully numerous times.
>
> Thread Dump of cassandra server when repair was getting stuck:
>
> "*AntiEntropyStage:1" #159 daemon prio=5 os_prio=0 tid=0x7f0fa16226a0
> nid=0x3c82 waiting for monitor entry [0x7ee9eabaf000*]
>java.lang.Thread.State: BLOCKED (*on object monitor*)
> at org.apache.cassandra.service.ActiveRepairService.removeParen
> tRepairSession(ActiveRepairService.java:392)
> - waiting to lock <0x00067c083308> (a
> org.apache.cassandra.service.ActiveRepairService)
> at org.apache.cassandra.service.ActiveRepairService.doAntiCompa
> ction(ActiveRepairService.java:417)
> at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(
> RepairMessageVerbHandler.java:145)
> at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeli
> veryTask.java:67)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executor
> s.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> Executor.java:1142)
>
> Hope it helps!
>
> Regards,
> Bhuvan
>
> According to https://medium.com/@mlowicki/cassandra-reaper-introductio
> n-ed73410492bf#.f0erygqpk :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS*
>  (20 by default)
> 2. Node is already running repair job
>
>
> On Tue, Jan 3, 2017 at 11:16 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi Daniel,
>>
>> could you file a bug in the issue tracker ? https://github.com/thelastpi
>> ckle/cassandra-reaper/issues
>>
>> We'll figure out what's wrong and get your repairs running.
>>
>> Thanks !
>>
>> On Tue, Jan 3, 2017 at 12:35 AM Daniel Kleviansky 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Using The Last Pickle's fork of Reaper, and unfortunately running into a
>>> bit of an issue. I'll try break it down below.
>>>
>>> # Problem Description:
>>> * After starting repair via the GUI, progress remains at 0/x.
>>> * Cassandra nodes calculate their respective token ranges, and then
>>> nothing happens.
>>> * There were no errors in the Reaper or Cassandra logs. Only a message
>>> of acknowledgement that a repair had initiated.
>>> * Performing stack trace on the running JVM, once can see that the
>>> thread spawning the repair process was waiting on a lock that was never
>>> being released.
>>> * This occurred on all nodes, and prevented any manually initiated
>>> repair 

Re: Reaper repair seems to "hang"

2017-01-03 Thread Daniel Kleviansky
Hi Bhuvan,

Thank you so very much for your detailed reply.
Just to ensure everyone is across the same information, and responses are
not duplicated across two different forums, I thought I'd share with the
mailing list that I've created a GitHub issue at:
https://github.com/thelastpickle/cassandra-reaper/issues/39

Kind regards,
Daniel

On Wed, Jan 4, 2017 at 6:31 AM, Bhuvan Rawal  wrote:

> Hi Daniel,
>
> We faced a similar issue during repair with reaper. We ran repair with
> more repair threads than number of cassandra nodes. But on and off repair
> was getting stuck and we had to do rolling restart of cluster or wait for
> lock time to expire (~1hr).
>
> We had a look at the stuck repair, threadpools were getting stuck at
> AntiEntropy stage. From the synchronized block in repair code it appeared
> that per node at max 1 concurrent repair session per node is possible.
>
> According to https://medium.com/@mlowicki/cassandra-reaper-
> introduction-ed73410492bf#.f0erygqpk :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS*
>  (20 by default)
> *2. Node is already running repair job*
>
> We tried running reaper with number of threads less than number of nodes
> (assuming reaper will not submit multiple segments to single cassandra
> node) but still it was observed that multiple repair segments were going to
> same node concurrently and threfore chances of nodes getting stuck in that
> state was possible. Finally we settled with single repair thread in reaper
> settings. Although takes a slightly more time but has completed
> successfully numerous times.
>
> Thread Dump of cassandra server when repair was getting stuck:
>
> "*AntiEntropyStage:1" #159 daemon prio=5 os_prio=0 tid=0x7f0fa16226a0
> nid=0x3c82 waiting for monitor entry [0x7ee9eabaf000*]
>java.lang.Thread.State: BLOCKED (*on object monitor*)
> at org.apache.cassandra.service.ActiveRepairService.removeParen
> tRepairSession(ActiveRepairService.java:392)
> - waiting to lock <0x00067c083308> (a
> org.apache.cassandra.service.ActiveRepairService)
> at org.apache.cassandra.service.ActiveRepairService.doAntiCompa
> ction(ActiveRepairService.java:417)
> at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(
> RepairMessageVerbHandler.java:145)
> at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeli
> veryTask.java:67)
> at java.util.concurrent.Executors$RunnableAdapter.call(
> Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> Executor.java:1142)
>
> Hope it helps!
>
> Regards,
> Bhuvan
>
> According to https://medium.com/@mlowicki/cassandra-reaper-
> introduction-ed73410492bf#.f0erygqpk :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS*
>  (20 by default)
> 2. Node is already running repair job
>
>
> On Tue, Jan 3, 2017 at 11:16 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi Daniel,
>>
>> could you file a bug in the issue tracker ? https://github.com/thelastpi
>> ckle/cassandra-reaper/issues
>>
>> We'll figure out what's wrong and get your repairs running.
>>
>> Thanks !
>>
>> On Tue, Jan 3, 2017 at 12:35 AM Daniel Kleviansky 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Using The Last Pickle's fork of Reaper, and unfortunately running into a
>>> bit of an issue. I'll try break it down below.
>>>
>>> # Problem Description:
>>> * After starting repair via the GUI, progress remains at 0/x.
>>> * Cassandra nodes calculate their respective token ranges, and then
>>> nothing happens.
>>> * There were no errors in the Reaper or Cassandra logs. Only a message
>>> of acknowledgement that a repair had initiated.
>>> * Performing stack trace on the running JVM, once can see that the
>>> thread spawning the repair process was waiting on a lock that was never
>>> being released.
>>> * This occurred on all nodes, and prevented any manually initiated
>>> repair process from running. A rolling restart of each node was required,
>>> after which one could run a `nodetool repair` successfully.
>>>
>>> # Cassandra Cluster Details:
>>> * Cassandra 2.2.5 running on Windows Server 2008 R2
>>> * 6 node cluster, split across 2 DCs, with RF = 3:3.
>>>
>>> # Reaper Details:
>>> * Reaper 0.3.3 running on Windows Server 2008 R2, utilising a PostgreSQL
>>> database.
>>>
>>> ## Reaper settings:
>>> * Parallism: DC-Aware
>>> * Repair Intensity: 0.9
>>> * Incremental: true
>>>
>>> Don't want to swamp you with more details or unnecessary logs,
>>> especially as I'd have to sanitize them before sending them out, so please
>>> let me know if 

Re: Reaper repair seems to "hang"

2017-01-03 Thread Bhuvan Rawal
Hi Daniel,

We faced a similar issue during repair with reaper. We ran repair with more
repair threads than number of cassandra nodes. But on and off repair was
getting stuck and we had to do rolling restart of cluster or wait for lock
time to expire (~1hr).

We had a look at the stuck repair, threadpools were getting stuck at
AntiEntropy stage. From the synchronized block in repair code it appeared
that per node at max 1 concurrent repair session per node is possible.

According to https://medium.com/@mlowicki/cassandra-reaper-introduction-
ed73410492bf#.f0erygqpk :

Segment runner has protection mechanism to avoid overloading nodes using
two simple rules to postpone repair if:

1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* (20
by default)
*2. Node is already running repair job*

We tried running reaper with number of threads less than number of nodes
(assuming reaper will not submit multiple segments to single cassandra
node) but still it was observed that multiple repair segments were going to
same node concurrently and threfore chances of nodes getting stuck in that
state was possible. Finally we settled with single repair thread in reaper
settings. Although takes a slightly more time but has completed
successfully numerous times.

Thread Dump of cassandra server when repair was getting stuck:

"*AntiEntropyStage:1" #159 daemon prio=5 os_prio=0 tid=0x7f0fa16226a0
nid=0x3c82 waiting for monitor entry [0x7ee9eabaf000*]
   java.lang.Thread.State: BLOCKED (*on object monitor*)
at org.apache.cassandra.service.ActiveRepairService.
removeParentRepairSession(ActiveRepairService.java:392)
- waiting to lock <0x00067c083308> (a
org.apache.cassandra.service.ActiveRepairService)
at org.apache.cassandra.service.ActiveRepairService.
doAntiCompaction(ActiveRepairService.java:417)
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(
RepairMessageVerbHandler.java:145)
at org.apache.cassandra.net.MessageDeliveryTask.run(
MessageDeliveryTask.java:67)
at java.util.concurrent.Executors$RunnableAdapter.
call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)

Hope it helps!

Regards,
Bhuvan

According to https://medium.com/@mlowicki/cassandra-reaper-introduction-
ed73410492bf#.f0erygqpk :

Segment runner has protection mechanism to avoid overloading nodes using
two simple rules to postpone repair if:

1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* (20
by default)
2. Node is already running repair job


On Tue, Jan 3, 2017 at 11:16 AM, Alexander Dejanovski <
a...@thelastpickle.com> wrote:

> Hi Daniel,
>
> could you file a bug in the issue tracker ? https://github.com/
> thelastpickle/cassandra-reaper/issues
>
> We'll figure out what's wrong and get your repairs running.
>
> Thanks !
>
> On Tue, Jan 3, 2017 at 12:35 AM Daniel Kleviansky 
> wrote:
>
>> Hi everyone,
>>
>> Using The Last Pickle's fork of Reaper, and unfortunately running into a
>> bit of an issue. I'll try break it down below.
>>
>> # Problem Description:
>> * After starting repair via the GUI, progress remains at 0/x.
>> * Cassandra nodes calculate their respective token ranges, and then
>> nothing happens.
>> * There were no errors in the Reaper or Cassandra logs. Only a message of
>> acknowledgement that a repair had initiated.
>> * Performing stack trace on the running JVM, once can see that the thread
>> spawning the repair process was waiting on a lock that was never being
>> released.
>> * This occurred on all nodes, and prevented any manually initiated repair
>> process from running. A rolling restart of each node was required, after
>> which one could run a `nodetool repair` successfully.
>>
>> # Cassandra Cluster Details:
>> * Cassandra 2.2.5 running on Windows Server 2008 R2
>> * 6 node cluster, split across 2 DCs, with RF = 3:3.
>>
>> # Reaper Details:
>> * Reaper 0.3.3 running on Windows Server 2008 R2, utilising a PostgreSQL
>> database.
>>
>> ## Reaper settings:
>> * Parallism: DC-Aware
>> * Repair Intensity: 0.9
>> * Incremental: true
>>
>> Don't want to swamp you with more details or unnecessary logs, especially
>> as I'd have to sanitize them before sending them out, so please let me know
>> if there is anything else I can provide, and I'll do my best to get it to
>> you.
>>
>> ​Kind regards,
>> Daniel
>>
> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>


Re: Reaper repair seems to "hang"

2017-01-02 Thread Alexander Dejanovski
Hi Daniel,

could you file a bug in the issue tracker ?
https://github.com/thelastpickle/cassandra-reaper/issues

We'll figure out what's wrong and get your repairs running.

Thanks !

On Tue, Jan 3, 2017 at 12:35 AM Daniel Kleviansky 
wrote:

> Hi everyone,
>
> Using The Last Pickle's fork of Reaper, and unfortunately running into a
> bit of an issue. I'll try break it down below.
>
> # Problem Description:
> * After starting repair via the GUI, progress remains at 0/x.
> * Cassandra nodes calculate their respective token ranges, and then
> nothing happens.
> * There were no errors in the Reaper or Cassandra logs. Only a message of
> acknowledgement that a repair had initiated.
> * Performing stack trace on the running JVM, once can see that the thread
> spawning the repair process was waiting on a lock that was never being
> released.
> * This occurred on all nodes, and prevented any manually initiated repair
> process from running. A rolling restart of each node was required, after
> which one could run a `nodetool repair` successfully.
>
> # Cassandra Cluster Details:
> * Cassandra 2.2.5 running on Windows Server 2008 R2
> * 6 node cluster, split across 2 DCs, with RF = 3:3.
>
> # Reaper Details:
> * Reaper 0.3.3 running on Windows Server 2008 R2, utilising a PostgreSQL
> database.
>
> ## Reaper settings:
> * Parallism: DC-Aware
> * Repair Intensity: 0.9
> * Incremental: true
>
> Don't want to swamp you with more details or unnecessary logs, especially
> as I'd have to sanitize them before sending them out, so please let me know
> if there is anything else I can provide, and I'll do my best to get it to
> you.
>
> ​Kind regards,
> Daniel
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com