Re: How to get rid of "Cannot start multiple repair sessions over the same sstables" exception

Alexander Dejanovski Wed, 28 Sep 2016 08:55:07 -0700

Robert,

You can restart them in any order, that doesn't make a difference afaik.


Cheers

Le mer. 28 sept. 2016 17:10, Robert Sicoie <robert.sic...@gmail.com> a
écrit :

> Thanks Alexander,
>
> Yes, with tpstats I can see the hanging active repair(s) (output
> attached). For one there are 31 pending repair. On others there are less
> pending repairs (min 12). Is there any recomandation for the restart order?
> The one with more less pending repairs first, perhaps?
>
> Thanks,
> Robert
>
> Robert Sicoie
>
> On Wed, Sep 28, 2016 at 5:35 PM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> They will show up in nodetool compactionstats :
>> https://issues.apache.org/jira/browse/CASSANDRA-9098
>>
>> Did you check nodetool tpstats to see if you didn't have any running
>> repair session ?
>> Just to make sure (and if you can actually do it), roll restart the
>> cluster and try again. Repair sessions can get sticky sometimes.
>>
>> On Wed, Sep 28, 2016 at 4:23 PM Robert Sicoie <robert.sic...@gmail.com>
>> wrote:
>>
>>> I am using nodetool compactionstats to check for pending compactions and
>>> it shows me 0 pending on all nodes, seconds before running nodetool repair.
>>> I am also monitoring PendingCompactions on jmx.
>>>
>>> Is there other way I can find out if is there any anticompaction running
>>> on any node?
>>>
>>> Thanks a lot,
>>> Robert
>>>
>>> Robert Sicoie
>>>
>>> On Wed, Sep 28, 2016 at 4:44 PM, Alexander Dejanovski <
>>> a...@thelastpickle.com> wrote:
>>>
>>>> Robert,
>>>>
>>>> you need to make sure you have no repair session currently running on
>>>> your cluster, and no anticompaction.
>>>> I'd recommend doing a rolling restart in order to stop all running
>>>> repair for sure, then start the process again, node by node, checking that
>>>> no anticompaction is running before moving from one node to the other.
>>>>
>>>> Please do not use the -pr switch as it is both useless (token ranges
>>>> are repaired only once with inc repair, whatever the replication factor)
>>>> and harmful as all anticompactions won't be executed (you'll still have
>>>> sstables marked as unrepaired even if the process has ran entirely with no
>>>> error).
>>>>
>>>> Let us know how that goes.
>>>>
>>>> Cheers,
>>>>
>>>> On Wed, Sep 28, 2016 at 2:57 PM Robert Sicoie <robert.sic...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Alexander,
>>>>>
>>>>> Now I started to run the repair with -pr arg and with keyspace and
>>>>> table args.
>>>>> Still, I got the "ERROR [RepairJobTask:1] 2016-09-28 11:34:38,288
>>>>> RepairRunnable.java:246 - Repair session
>>>>> 89af4d10-856f-11e6-b28f-df99132d7979 for range
>>>>> [(8323429577695061526,8326640819362122791],
>>>>> ..., (4212695343340915405,4229348077081465596]]] Validation failed in /
>>>>> 10.45.113.88"
>>>>>
>>>>> for one of the tables. 10.45.113.88 is the ip of the machine I am
>>>>> running the nodetool on.
>>>>> I'm wondering if this is normal...
>>>>>
>>>>> Thanks,
>>>>> Robert
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Robert Sicoie
>>>>>
>>>>> On Wed, Sep 28, 2016 at 11:53 AM, Alexander Dejanovski <
>>>>> a...@thelastpickle.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> nodetool scrub won't help here, as what you're experiencing is most
>>>>>> likely that one SSTable is going through anticompaction, and then another
>>>>>> node is asking for a Merkle tree that involves it.
>>>>>> For understandable reasons, an SSTable cannot be anticompacted and
>>>>>> validation compacted at the same time.
>>>>>>
>>>>>> The solution here is to adjust the repair pressure on your cluster so
>>>>>> that anticompaction can end before you run repair on another node.
>>>>>> You may have a lot of anticompaction to do if you had high volumes of
>>>>>> unrepaired data, which can take a long time depending on several factors.
>>>>>>
>>>>>> You can tune your repair process to make sure no anticompaction is
>>>>>> running before launching a new session on another node or you can try my
>>>>>> Reaper fork that handles incremental repair :
>>>>>> https://github.com/adejanovski/cassandra-reaper/tree/inc-repair-support-with-ui
>>>>>> I may have to add a few checks in order to avoid all collisions
>>>>>> between anticompactions and new sessions, but it should be helpful if you
>>>>>> struggle with incremental repair.
>>>>>>
>>>>>> In any case, check if your nodes are still anticompacting before
>>>>>> trying to run a new repair session on a node.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 28, 2016 at 10:31 AM Robert Sicoie <
>>>>>> robert.sic...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> I have a cluster of 5 nodes, cassandra 3.0.5.
>>>>>>> I was running nodetool repair last days, one node at a time, when I
>>>>>>> first encountered this exception
>>>>>>>
>>>>>>> *ERROR [ValidationExecutor:11] 2016-09-27 16:12:20,409
>>>>>>> CassandraDaemon.java:195 - Exception in thread
>>>>>>> Thread[ValidationExecutor:11,1,main]*
>>>>>>> *java.lang.RuntimeException: Cannot start multiple repair sessions
>>>>>>> over the same sstables*
>>>>>>> * at
>>>>>>> org.apache.cassandra.db.compaction.CompactionManager.getSSTablesToValidate(CompactionManager.java:1194)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at
>>>>>>> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1084)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at
>>>>>>> org.apache.cassandra.db.compaction.CompactionManager.access$700(CompactionManager.java:80)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at
>>>>>>> org.apache.cassandra.db.compaction.CompactionManager$10.call(CompactionManager.java:714)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>>> ~[na:1.8.0_60]*
>>>>>>> * at
>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>> ~[na:1.8.0_60]*
>>>>>>> * at
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>> [na:1.8.0_60]*
>>>>>>> * at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]*
>>>>>>>
>>>>>>> On some of the other boxes I see this:
>>>>>>>
>>>>>>>
>>>>>>> *Caused by: org.apache.cassandra.exceptions.RepairException: [repair
>>>>>>> #9dd21ab0-83f4-11e6-b28f-df99132d7979 on notes/operator_source_mv,
>>>>>>> [(-7505573573695693981,-7495786486761919991],*
>>>>>>> *....*
>>>>>>> * (-8483612809930827919,-8480482504800860871]]] Validation failed in
>>>>>>> /10.45.113.67 <http://10.45.113.67>*
>>>>>>> * at
>>>>>>> org.apache.cassandra.repair.ValidationTask.treesReceived(ValidationTask.java:68)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at
>>>>>>> org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at
>>>>>>> org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:408)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at
>>>>>>> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:168)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at org.apache.cassandra.net
>>>>>>> <http://org.apache.cassandra.net>.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at
>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>>>> ~[na:1.8.0_60]*
>>>>>>> * at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>>>> ~[na:1.8.0_60]*
>>>>>>> * at
>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>> [na:1.8.0_60]*
>>>>>>> * at
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>> [na:1.8.0_60]*
>>>>>>> * at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]*
>>>>>>> *ERROR [RepairJobTask:3] 2016-09-26 16:39:33,096
>>>>>>> CassandraDaemon.java:195 - Exception in thread 
>>>>>>> Thread[RepairJobTask:3,5,RMI
>>>>>>> Runtime]*
>>>>>>> *java.lang.AssertionError: java.lang.InterruptedException*
>>>>>>> * at org.apache.cassandra.net
>>>>>>> <http://org.apache.cassandra.net>.OutboundTcpConnection.enqueue(OutboundTcpConnection.java:172)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at org.apache.cassandra.net
>>>>>>> <http://org.apache.cassandra.net>.MessagingService.sendOneWay(MessagingService.java:761)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at org.apache.cassandra.net
>>>>>>> <http://org.apache.cassandra.net>.MessagingService.sendOneWay(MessagingService.java:729)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at
>>>>>>> org.apache.cassandra.repair.ValidationTask.run(ValidationTask.java:56)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * at
>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>> ~[na:1.8.0_60]*
>>>>>>> * at
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>> ~[na:1.8.0_60]*
>>>>>>> * at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_60]*
>>>>>>> *Caused by: java.lang.InterruptedException: null*
>>>>>>> * at
>>>>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
>>>>>>> ~[na:1.8.0_60]*
>>>>>>> * at
>>>>>>> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
>>>>>>> ~[na:1.8.0_60]*
>>>>>>> * at
>>>>>>> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
>>>>>>> ~[na:1.8.0_60]*
>>>>>>> * at org.apache.cassandra.net
>>>>>>> <http://org.apache.cassandra.net>.OutboundTcpConnection.enqueue(OutboundTcpConnection.java:168)
>>>>>>> ~[apache-cassandra-3.0.5.jar:3.0.5]*
>>>>>>> * ... 6 common frames omitted*
>>>>>>>
>>>>>>>
>>>>>>> Now if I run nodetool repair I get the
>>>>>>>
>>>>>>> *java.lang.RuntimeException: Cannot start multiple repair sessions
>>>>>>> over the same sstables*
>>>>>>>
>>>>>>> exception.
>>>>>>> What do you suggest? would nodetool scrub or sstablescrub help in
>>>>>>> this case. or it would just make it worse?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Robert
>>>>>>>
>>>>>> --
>>>>>> -----------------
>>>>>> Alexander Dejanovski
>>>>>> France
>>>>>> @alexanderdeja
>>>>>>
>>>>>> Consultant
>>>>>> Apache Cassandra Consulting
>>>>>> http://www.thelastpickle.com
>>>>>>
>>>>>
>>>>> --
>>>> -----------------
>>>> Alexander Dejanovski
>>>> France
>>>> @alexanderdeja
>>>>
>>>> Consultant
>>>> Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>
>>> --
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
> --
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: How to get rid of "Cannot start multiple repair sessions over the same sstables" exception

Reply via email to