[
https://issues.apache.org/jira/browse/CASSANDRA-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503736#comment-17503736
]
Brandon Williams commented on CASSANDRA-17172:
----------------------------------------------
The messages are related, so that sounds like a plausible theory.
> incremental repairs get stuck often
> -----------------------------------
>
> Key: CASSANDRA-17172
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17172
> Project: Cassandra
> Issue Type: Bug
> Components: Consistency/Repair
> Reporter: James Brown
> Priority: Normal
> Fix For: 4.0.x
>
>
> We're on 4.0.1 and switched to incremental repairs shortly after upgrading to
> 4.0.x. They work fine about 95% of the time, but once in a while a session
> will get stuck and will have to be cancelled (with `nodetool repair_admin
> cancel -s <uuid>`). Typically the session will be in REPAIRING but nothing
> will actually be happening.
> Output of nodetool repair_admin:
> {code}
> $ nodetool repair_admin
> id | state | last activity |
> coordinator | participants
>
>
>
>
>
> | participants_wp
> 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 | REPAIRING | 6771 (s) |
> /[fd00:ea51:d057:200:1:0:0:8e]:25472 |
> fd00:ea51:d057:200:1:0:0:8e,fd00:ea51:d057:200:1:0:0:8f,fd00:ea51:d057:200:1:0:0:92,fd00:ea51:d057:100:1:0:0:571,fd00:ea51:d057:100:1:0:0:570,fd00:ea51:d057:200:1:0:0:93,fd00:ea51:d057:100:1:0:0:573,fd00:ea51:d057:200:1:0:0:90,fd00:ea51:d057:200:1:0:0:91,fd00:ea51:d057:100:1:0:0:572,fd00:ea51:d057:100:1:0:0:575,fd00:ea51:d057:100:1:0:0:574,fd00:ea51:d057:200:1:0:0:94,fd00:ea51:d057:100:1:0:0:577,fd00:ea51:d057:200:1:0:0:95,fd00:ea51:d057:100:1:0:0:576
> |
> [fd00:ea51:d057:200:1:0:0:8e]:25472,[fd00:ea51:d057:200:1:0:0:8f]:25472,[fd00:ea51:d057:200:1:0:0:92]:25472,[fd00:ea51:d057:100:1:0:0:571]:25472,[fd00:ea51:d057:100:1:0:0:570]:25472,[fd00:ea51:d057:200:1:0:0:93]:25472,[fd00:ea51:d057:100:1:0:0:573]:25472,[fd00:ea51:d057:200:1:0:0:90]:25472,[fd00:ea51:d057:200:1:0:0:91]:25472,[fd00:ea51:d057:100:1:0:0:572]:25472,[fd00:ea51:d057:100:1:0:0:575]:25472,[fd00:ea51:d057:100:1:0:0:574]:25472,[fd00:ea51:d057:200:1:0:0:94]:25472,[fd00:ea51:d057:100:1:0:0:577]:25472,[fd00:ea51:d057:200:1:0:0:95]:25472,[fd00:ea51:d057:100:1:0:0:576]:25472
> {code}
> Running `jstack` on the coordinator shows two repair threads, both idle:
> {code}
> "Repair#167:1" #602177 daemon prio=5 os_prio=0 cpu=9.60ms elapsed=57359.81s
> tid=0x00007fa6d1741800 nid=0x18e6c waiting on condition [0x00007fc529f9a000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
> - parking to wait for <0x000000045ba93a18> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at
> java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2123)
> at
> java.util.concurrent.LinkedBlockingQueue.poll([email protected]/LinkedBlockingQueue.java:458)
> at
> java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1053)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1114)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run([email protected]/Thread.java:829)
> "Repair#170:1" #654814 daemon prio=5 os_prio=0 cpu=9.62ms elapsed=7369.98s
> tid=0x00007fa6aec09000 nid=0x1a96f waiting on condition [0x00007fc535aae000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
> - parking to wait for <0x00000004c45bf7d8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at
> java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2123)
> at
> java.util.concurrent.LinkedBlockingQueue.poll([email protected]/LinkedBlockingQueue.java:458)
> at
> java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1053)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1114)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run([email protected]/Thread.java:829)
> {code}
> nodetool netstats says there is nothing happening:
> {code}
> $ nodetool netstats | head -n 2
> Mode: NORMAL
> Not sending any streams.
> {code}
> There's nothing interesting in the logs for this repair; the last relevant
> thing was a bunch of "Created 0 sync tasks based on 6 merkle tree responses
> for 3a059b10-4ef6-11ec-
> 925f-8f7bcf0ba035 (took: 0ms)" and then back and forth for the last couple of
> hours with things like
> {code}
> 2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO [OptionalTasks:1]
> LocalSessions.java:938 - Attempting to learn the outcome of unfinished local
> incremental repair session 3a059b10-4ef6-11ec-925f-8f7bcf0ba035
> 2021-11-26T21:33:20Z cassandra10nuq 129529 | INFO [AntiEntropyStage:1]
> LocalSessions.java:987 - Received StatusResponse for repair session
> 3a059b10-4ef6-11ec-925f-8f7bcf0ba035 with state REPAIRING, which is not
> actionable. Doing nothing.
> {code}
> Typically, cancelling the session and rerunning with the exact same command
> line will succeed.
> I haven't witnessed this behavior in our testing cluster; it seems to only
> happen on biggish keyspaces.
> I know there have historically been issues which this, which is why tools
> like Reaper kill any repair that takes more than a short period of time, but
> I also thought they were expected to have been fixed with 4.0.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]