Thanks Alexander,

Yes, with tpstats I can see the hanging active repair(s) (output attached).
For one there are 31 pending repair. On others there are less pending
repairs (min 12). Is there any recomandation for the restart order? The one
with more less pending repairs first, perhaps?

Thanks,
Robert

Robert Sicoie

On Wed, Sep 28, 2016 at 5:35 PM, Alexander Dejanovski <
a...@thelastpickle.com> wrote:

> They will show up in nodetool compactionstats : https://issues.apache.org/
> jira/browse/CASSANDRA-9098
>
> Did you check nodetool tpstats to see if you didn't have any running
> repair session ?
> Just to make sure (and if you can actually do it), roll restart the
> cluster and try again. Repair sessions can get sticky sometimes.
>
> On Wed, Sep 28, 2016 at 4:23 PM Robert Sicoie <robert.sic...@gmail.com>
> wrote:
>
>> I am using nodetool compactionstats to check for pending compactions and
>> it shows me 0 pending on all nodes, seconds before running nodetool repair.
>> I am also monitoring PendingCompactions on jmx.
>>
>> Is there other way I can find out if is there any anticompaction running
>> on any node?
>>
>> Thanks a lot,
>> Robert
>>
>> Robert Sicoie
>>
>> On Wed, Sep 28, 2016 at 4:44 PM, Alexander Dejanovski <
>> a...@thelastpickle.com> wrote:
>>
>>> Robert,
>>>
>>> you need to make sure you have no repair session currently running on
>>> your cluster, and no anticompaction.
>>> I'd recommend doing a rolling restart in order to stop all running
>>> repair for sure, then start the process again, node by node, checking that
>>> no anticompaction is running before moving from one node to the other.
>>>
>>> Please do not use the -pr switch as it is both useless (token ranges are
>>> repaired only once with inc repair, whatever the replication factor) and
>>> harmful as all anticompactions won't be executed (you'll still have
>>> sstables marked as unrepaired even if the process has ran entirely with no
>>> error).
>>>
>>> Let us know how that goes.
>>>
>>> Cheers,
>>>
>>> On Wed, Sep 28, 2016 at 2:57 PM Robert Sicoie <robert.sic...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Alexander,
>>>>
>>>> Now I started to run the repair with -pr arg and with keyspace and
>>>> table args.
>>>> Still, I got the "ERROR [RepairJobTask:1] 2016-09-28 11:34:38,288
>>>> RepairRunnable.java:246 - Repair session 
>>>> 89af4d10-856f-11e6-b28f-df99132d7979
>>>> for range [(8323429577695061526,8326640819362122791],
>>>> ..., (4212695343340915405,4229348077081465596]]] Validation failed in /
>>>> 10.45.113.88"
>>>>
>>>> for one of the tables. 10.45.113.88 is the ip of the machine I am
>>>> running the nodetool on.
>>>> I'm wondering if this is normal...
>>>>
>>>> Thanks,
>>>> Robert
>>>>
>>>>
>>>>
>>>>
>>>> Robert Sicoie
>>>>
>>>> On Wed, Sep 28, 2016 at 11:53 AM, Alexander Dejanovski <
>>>> a...@thelastpickle.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> nodetool scrub won't help here, as what you're experiencing is most
>>>>> likely that one SSTable is going through anticompaction, and then another
>>>>> node is asking for a Merkle tree that involves it.
>>>>> For understandable reasons, an SSTable cannot be anticompacted and
>>>>> validation compacted at the same time.
>>>>>
>>>>> The solution here is to adjust the repair pressure on your cluster so
>>>>> that anticompaction can end before you run repair on another node.
>>>>> You may have a lot of anticompaction to do if you had high volumes of
>>>>> unrepaired data, which can take a long time depending on several factors.
>>>>>
>>>>> You can tune your repair process to make sure no anticompaction is
>>>>> running before launching a new session on another node or you can try my
>>>>> Reaper fork that handles incremental repair : https://github.com/
>>>>> adejanovski/cassandra-reaper/tree/inc-repair-support-with-ui
>>>>> I may have to add a few checks in order to avoid all collisions
>>>>> between anticompactions and new sessions, but it should be helpful if you
>>>>> struggle with incremental repair.
>>>>>
>>>>> In any case, check if your nodes are still anticompacting before
>>>>> trying to run a new repair session on a node.
>>>>>
>>>>> Cheers,
>>>>>
>>>>>
>>>>> On Wed, Sep 28, 2016 at 10:31 AM Robert Sicoie <
>>>>> robert.sic...@gmail.com> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I have a cluster of 5 nodes, cassandra 3.0.5.
>>>>>> I was running nodetool repair last days, one node at a time, when I
>>>>>> first encountered this exception
>>>>>>
>>>>>> *ERROR [ValidationExecutor:11] 2016-09-27 16:12:20,409
>>>>>> CassandraDaemon.java:195 - Exception in thread
>>>>>> Thread[ValidationExecutor:11,1,main]*
>>>>>> *java.lang.RuntimeException: Cannot start multiple repair sessions
>>>>>> over the same sstables*
>>>>>> * at
>>>>>> org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1194)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at
>>>>>> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1084)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at
>>>>>> org.apache.cassandra.db.compaction.CompactionManager.access$700(CompactionManager.java:80)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at
>>>>>> org.apache.cassandra.db.compaction.CompactionManager$10.call(CompactionManager.java:714)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>> ~[na:1.8.0_60]*
>>>>>> * at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>> ~[na:1.8.0_60]*
>>>>>> * at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>> [na:1.8.0_60]*
>>>>>> * at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]*
>>>>>>
>>>>>> On some of the other boxes I see this:
>>>>>>
>>>>>>
>>>>>> *Caused by: org.apache.cassandra.exceptions.RepairException: [repair
>>>>>> #9dd21ab0-83f4-11e6-b28f-df99132d7979 on notes/operator_source_mv,
>>>>>> [(-7505573573695693981,-7495786486761919991],*
>>>>>> *....*
>>>>>> * (-8483612809930827919,-8480482504800860871]]] Validation failed in
>>>>>> /10.45.113.67 <http://10.45.113.67>*
>>>>>> * at
>>>>>> org.apache.cassandra.repair.ValidationTask.treesReceived(ValidationTask.java:68)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at
>>>>>> org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at
>>>>>> org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:408)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at
>>>>>> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:168)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at org.apache.cassandra.net
>>>>>> <http://org.apache.cassandra.net>.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at
>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>>> ~[na:1.8.0_60]*
>>>>>> * at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>> ~[na:1.8.0_60]*
>>>>>> * at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>> [na:1.8.0_60]*
>>>>>> * at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>> [na:1.8.0_60]*
>>>>>> * at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]*
>>>>>> *ERROR [RepairJobTask:3] 2016-09-26 16:39:33,096
>>>>>> CassandraDaemon.java:195 - Exception in thread 
>>>>>> Thread[RepairJobTask:3,5,RMI
>>>>>> Runtime]*
>>>>>> *java.lang.AssertionError: java.lang.InterruptedException*
>>>>>> * at org.apache.cassandra.net
>>>>>> <http://org.apache.cassandra.net>.OutboundTcpConnection.enqueue(OutboundTcpConnection.java:172)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at org.apache.cassandra.net
>>>>>> <http://org.apache.cassandra.net>.MessagingService.sendOneWay(MessagingService.java:761)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at org.apache.cassandra.net
>>>>>> <http://org.apache.cassandra.net>.MessagingService.sendOneWay(MessagingService.java:729)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at
>>>>>> org.apache.cassandra.repair.ValidationTask.run(ValidationTask.java:56)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>> ~[na:1.8.0_60]*
>>>>>> * at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>> ~[na:1.8.0_60]*
>>>>>> * at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_60]*
>>>>>> *Caused by: java.lang.InterruptedException: null*
>>>>>> * at
>>>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
>>>>>> ~[na:1.8.0_60]*
>>>>>> * at
>>>>>> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
>>>>>> ~[na:1.8.0_60]*
>>>>>> * at
>>>>>> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
>>>>>> ~[na:1.8.0_60]*
>>>>>> * at org.apache.cassandra.net
>>>>>> <http://org.apache.cassandra.net>.OutboundTcpConnection.enqueue(OutboundTcpConnection.java:168)
>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>> * ... 6 common frames omitted*
>>>>>>
>>>>>>
>>>>>> Now if I run nodetool repair I get the
>>>>>>
>>>>>> *java.lang.RuntimeException: Cannot start multiple repair sessions
>>>>>> over the same sstables*
>>>>>>
>>>>>> exception.
>>>>>> What do you suggest? would nodetool scrub or sstablescrub help in
>>>>>> this case. or it would just make it worse?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Robert
>>>>>>
>>>>> --
>>>>> -----------------
>>>>> Alexander Dejanovski
>>>>> France
>>>>> @alexanderdeja
>>>>>
>>>>> Consultant
>>>>> Apache Cassandra Consulting
>>>>> http://www.thelastpickle.com
>>>>>
>>>>
>>>> --
>>> -----------------
>>> Alexander Dejanovski
>>> France
>>> @alexanderdeja
>>>
>>> Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>
>> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
NODE 1:

Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
MutationStage                     0         0       10319975         0          
       0
ViewMutationStage                 0         0              0         0          
       0
ReadStage                         0         0        1339254         0          
       0
RequestResponseStage              0         0       40531035         0          
       0
ReadRepairStage                   0         0              5         0          
       0
CounterMutationStage              0         0              0         0          
       0
Repair#4                          1        12              4         0          
       0
MiscStage                         0         0              0         0          
       0
CompactionExecutor                0         0         303842         0          
       0
MemtableReclaimMemory             0         0           1252         0          
       0
PendingRangeCalculator            0         0              9         0          
       0
GossipStage                       0         0        1901430         0          
       0
SecondaryIndexManagement          0         0              0         0          
       0
HintsDispatcher                   0         0              0         0          
       0
MigrationStage                    0         0              5         0          
       0
MemtablePostFlush                 0         0           1427         0          
       0
ValidationExecutor                0         0             75         0          
       0
Sampler                           0         0              0         0          
       0
MemtableFlushWriter               0         0           1252         0          
       0
InternalResponseStage             0         0             46         0          
       0
AntiEntropyStage                  0         0            245         0          
       0
CacheCleanupExecutor              0         0              0         0          
       0
Native-Transport-Requests         0         0           1902         0          
       0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                     0
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

NODE 2:

Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
MutationStage                     0         0       44317382         0          
       0
ViewMutationStage                 0         0              0         0          
       0
ReadStage                         0         0        2783259         0          
       0
RequestResponseStage              0         0        6218344         0          
       0
ReadRepairStage                   0         0         231401         0          
       0
CounterMutationStage              0         0              0         0          
       0
Repair#4                          1        14              2         0          
       0
Repair#5                          1        15              1         0          
       0
MiscStage                         0         0              0         0          
       0
CompactionExecutor                0         0         303728         0          
       0
MemtableReclaimMemory             0         0           1503         0          
       0
PendingRangeCalculator            0         0              8         0          
       0
GossipStage                       0         0        1898890         0          
       0
SecondaryIndexManagement          0         0              0         0          
       0
HintsDispatcher                   0         0              1         0          
       0
MigrationStage                    0         0              6         0          
       0
MemtablePostFlush                 0         0           1684         0          
       0
ValidationExecutor                0         0             78         0          
       0
Sampler                           0         0              0         0          
       0
MemtableFlushWriter               0         0           1503         0          
       0
InternalResponseStage             0         0             28         0          
       0
AntiEntropyStage                  0         0            246         0          
       0
CacheCleanupExecutor              0         0              0         0          
       0
Native-Transport-Requests         0         0        9647628         0          
       0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                     0
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

NODE 3:

Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
MutationStage                     0         0       47318571         0          
       0
ViewMutationStage                 0         0              0         0          
       0
ReadStage                         0         0        2628095         0          
       0
RequestResponseStage              0         0        5950989         0          
       0
ReadRepairStage                   0         0         215773         0          
       0
CounterMutationStage              0         0              0         0          
       0
Repair#4                          1        15              1         0          
       0
MiscStage                         0         0              0         0          
       0
CompactionExecutor                0         0         303587         0          
       0
MemtableReclaimMemory             0         0           1337         0          
       0
PendingRangeCalculator            0         0              8         0          
       0
GossipStage                       0         0        1684612         0          
       0
SecondaryIndexManagement          0         0              0         0          
       0
HintsDispatcher                   0         0              0         0          
       0
MigrationStage                    0         0              6         0          
       0
MemtablePostFlush                 0         0           1521         0          
       0
ValidationExecutor                0         0             71         0          
       0
Sampler                           0         0              0         0          
       0
MemtableFlushWriter               0         0           1337         0          
       0
InternalResponseStage             0         0             14         0          
       0
AntiEntropyStage                  0         0            226         0          
       0
CacheCleanupExecutor              0         0              0         0          
       0
Native-Transport-Requests         1         0        8830514         0          
       0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                     0
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

NODE 4:

Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
MutationStage                     0         0        3911326         0          
       0
ViewMutationStage                 0         0             20         0          
       0
ReadStage                         0         0        2242116         0          
       0
RequestResponseStage              0         0       24507251         0          
       0
ReadRepairStage                   0         0         207587         0          
       0
CounterMutationStage              0         0              0         0          
       0
Repair#1                          1        23              9         0          
       0
MiscStage                         0         0              0         0          
       0
CompactionExecutor                0         0         289034         0          
       0
MemtableReclaimMemory             0         0            780         0          
       0
PendingRangeCalculator            0         0              7         0          
       0
GossipStage                       0         0        1404078         0          
       0
SecondaryIndexManagement          0         0              0         0          
       0
HintsDispatcher                   0         0              0         0          
       0
MigrationStage                    0         0              9         0          
       0
MemtablePostFlush                 0         0            867         0          
       0
ValidationExecutor                0         0             12         0          
       0
Sampler                           0         0              0         0          
       0
MemtableFlushWriter               0         0            780         0          
       0
InternalResponseStage             0         0             10         0          
       0
AntiEntropyStage                  0         0             43         0          
       0
CacheCleanupExecutor              0         0              0         0          
       0
Native-Transport-Requests         0         0        8579499         0          
       0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                     0
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

NODE 5:

MutationStage                     0         0        4013686         0          
       0
ViewMutationStage                 0         0              0         0          
       0
ReadStage                         0         0        2202688         0          
       0
RequestResponseStage              0         0       23439680         0          
       0
ReadRepairStage                   0         0         204055         0          
       0
CounterMutationStage              0         0              0         0          
       0
Repair#1                          1        31              1         0          
       0
MiscStage                         0         0              0         0          
       0
CompactionExecutor                0         0         288627         0          
       0
MemtableReclaimMemory             0         0            782         0          
       0
PendingRangeCalculator            0         0              7         0          
       0
GossipStage                       0         0        1401350         0          
       0
SecondaryIndexManagement          0         0              0         0          
       0
HintsDispatcher                   0         0              0         0          
       0
MigrationStage                    0         0              9         0          
       0
MemtablePostFlush                 0         0            842         0          
       0
ValidationExecutor                0         0             10         0          
       0
Sampler                           0         0              0         0          
       0
MemtableFlushWriter               0         0            782         0          
       0
InternalResponseStage             0         0             13         0          
       0
AntiEntropyStage                  0         0             23         0          
       0
CacheCleanupExecutor              0         0              0         0          
       0
Native-Transport-Requests         0         0        8385155         0          
       0

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                     0
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0




Reply via email to