[ 
https://issues.apache.org/jira/browse/CASSANDRA-13885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173049#comment-16173049
 ] 

Thomas Steinmaurer edited comment on CASSANDRA-13885 at 9/20/17 11:48 AM:
--------------------------------------------------------------------------

It is about ease the operational side and that 2.2+ is a major shift towards 
behaving differently and being much more complex when I simply want to run a 
full repair across my 9 node cluster on 2 small volume CFs on a daily basis 
(grace period = 72hr) and being used to so by running the following with 2.1 
kicked off in parallel on all nodes:
{code}
nodetool repair -pr mykeyspace mycf1 mycf2
{code}
Ok, I learned incremental repair being the default since 2.2+, so I need to 
additionally apply the -full option. Ok, not a big deal, but when running the 
following with 3.0.14, again kicked off in parallel on all nodes:
{code}
nodetool repair -full -pr mykeyspace mycf1 mycf2
{code}
I start to see basically the following nodetool output:
{code}
...
[2017-09-20 11:34:49,968] Some repair failed
[2017-09-20 11:34:49,968] Repair command #8 finished in 0 seconds
error: Repair job has failed with the error message: [2017-09-20 11:34:49,968] 
Some repair failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: 
[2017-09-20 11:34:49,968] Some repair failed
        at 
org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:115)
        at 
org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
        at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
        at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
        at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
        at 
com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
{code}
With corresponding entries in the Cassandra log:
{noformat}
...
6084592,-2610211481793768452], (280506507907773715,302389115279520703], 
(-5974981857606828384,-5962141498717352776], 
(6642604399479339844,6664596384716805222], 
(3176178340546590823,3182242320217954219], 
(6534347373256357699,6534785652363368819], 
(-3756238465673315474,-3752190783358815211], 
(7139677986395944961,7145455101208653220], 
(-3297144043975661711,-3274612177648431803], 
(5273980670821159743,5281982202791896119], 
(-6128989336346960670,-6080468590993099589], 
(-2173810736498649004,-2131529908597487459], 
(7439773636855937356,7476905072738807852]]] Validation failed in /10.176.38.128
        at 
org.apache.cassandra.repair.ValidationTask.treesReceived(ValidationTask.java:68)
 ~[apache-cassandra-3.0.14.jar:3.0.14]
        at 
org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:178)
 ~[apache-cassandra-3.0.14.jar:3.0.14]
        at 
org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:486)
 ~[apache-cassandra-3.0.14.jar:3.0.14]
        at 
org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:164)
 ~[apache-cassandra-3.0.14.jar:3.0.14]
        at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67) 
~[apache-cassandra-3.0.14.jar:3.0.14]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[na:1.8.0_102]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
~[na:1.8.0_102]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_102]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_102]
        at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
 [apache-cassandra-3.0.14.jar:3.0.14]
        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_102]
INFO  [InternalResponseStage:32] 2017-09-20 11:41:58,054 
RepairRunnable.java:337 - Repair command #11 finished in 0 seconds
ERROR [ValidationExecutor:29] 2017-09-20 11:41:58,056 Validator.java:268 - 
Failed creating a merkle tree for [repair #b53b44a0-9df8-11e7-916c-a5c15f10854d 
on ruxitdb/Me2Data, [(-9036672081060178828,-9030154922268771156], 
(1469740174912727009,1543926123757478678], 
(8863036841963129257,8867114458641555677], 
(-2610211481793768452,-2603133469451342452], 
(-5434810958758711978,-5401236033897257975], 
(5446456273884963354,5512385756828046297], 
(-5733849916893192315,-5651354489457211297], 
(5579261856873396905,5629665914232130557], 
(-3661618321040339655,-3653143301436649195], 
(-3344525143879048394,-3314190367243835481], 
(2113416595214497156,2140252649319845130], 
(-186804760253388038,-136455684914788326], 
(130823363710141924,188931062065209030], 
(229372617650564758,256901816244047153], 
(-3460004924864535758,-3448189173914847013], 
(7667789006793829873,7672435884237063221], 
(-5401236033897257975,-5371782704264523053], 
(-3829469150597291433,-3823438964996675746], 
(8833078706147578756,8850650250670324319], 
(5112280378866264088,5193085768303122438], 
(4155723864378803139,4171414017862833361], 
(-840951991332283834,-820389464184628689], 
(-8599778977804844748,-8579712223690479957], 
(6900678321423523623,6900784348977090766], 
(-7453077334586977466,-7449408715037121306], 
(1703184128556034757,1708159674820812561], 
(772306949709931532,799988896726778408], (-5294307699953409870,-52800750682
...
{noformat}


was (Author: tsteinmaurer):
It is about ease the operational side and that 2.2+ is a major shift towards 
behaving differently and being much more complex when I simply want to run a 
full repair across my 9 node cluster on 2 small volume CFs on a daily basis 
(grace period = 72hr) and being used to so by running the following with 2.1 
kicked off in parallel on all nodes:
{code}
nodetool repair -pr mykeyspace mycf1 mycf2
{code}
Ok, I learned incremental repair being the default since 2.2+, so I need to 
additionally apply the -full option. Ok, not a big deal, but when running the 
following with 3.0.14, again kicked off in parallel on all nodes:
{code}
nodetool repair -full -pr mykeyspace mycf1 mycf2
{code}
I start to see basically the following nodetool output:
{code}
...
[2017-09-20 11:34:49,968] Some repair failed
[2017-09-20 11:34:49,968] Repair command #8 finished in 0 seconds
error: Repair job has failed with the error message: [2017-09-20 11:34:49,968] 
Some repair failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: 
[2017-09-20 11:34:49,968] Some repair failed
        at 
org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:115)
        at 
org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
        at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
        at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
        at 
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
        at 
com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
{code}


> Allow to run full repairs in 3.0 without additional cost of anti-compaction
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13885
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13885
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Thomas Steinmaurer
>
> This ticket is basically the result of the discussion in Cassandra user list: 
> https://www.mail-archive.com/user@cassandra.apache.org/msg53562.html
> I was asked to open a ticket by Paulo Motta to think about back-porting 
> running full repairs without the additional cost of anti-compaction.
> Basically there is no way in 3.0 to run full repairs from several nodes 
> concurrently without troubles caused by (overlapping?) anti-compactions. 
> Coming from 2.1 this is a major change from an operational POV, basically 
> breaking any e.g. cron job based solution kicking off -pr based repairs on 
> several nodes concurrently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to