[
https://issues.apache.org/jira/browse/CASSANDRA-13555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029421#comment-16029421
]
T Jake Luciani commented on CASSANDRA-13555:
--------------------------------------------
The issue here is we aren't finishing off the validations when the repair
session terminates:
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/repair/RepairSession.java#L302
So just need to walk that array and set results to null. Same with the
syncTasks.
[~szhou] would you be able to make a patch (and ideally a dtest)?
> Thread leak during repair
> -------------------------
>
> Key: CASSANDRA-13555
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13555
> Project: Cassandra
> Issue Type: Bug
> Reporter: Simon Zhou
> Assignee: Simon Zhou
>
> The symptom is similar to what happened in [CASSANDRA-13204 |
> https://issues.apache.org/jira/browse/CASSANDRA-13204] that the thread
> waiting forever doing nothing. This one happened during "nodetool repair -pr
> -seq -j 1" in production but I can easily simulate the problem with just
> "nodetool repair" in dev environment (CCM). I'm trying to explain what
> happened with 3.0.13 code base.
> 1. One node is down while doing repair. This is the error I saw in production:
> {code}
> ERROR [GossipTasks:1] 2017-05-19 15:00:10,545 RepairSession.java:334 -
> [repair #bc9a3cd1-3ca3-11e7-a44a-e30923ac9336] session completed with the
> following error
> java.io.IOException: Endpoint /10.185.43.15 died
> at
> org.apache.cassandra.repair.RepairSession.convict(RepairSession.java:333)
> ~[apache-cassandra-3.0.11.jar:3.0.11]
> at
> org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:306)
> [apache-cassandra-3.0.11.jar:3.0.11]
> at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:766)
> [apache-cassandra-3.0.11.jar:3.0.11]
> at org.apache.cassandra.gms.Gossiper.access$800(Gossiper.java:66)
> [apache-cassandra-3.0.11.jar:3.0.11]
> at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:181)
> [apache-cassandra-3.0.11.jar:3.0.11]
> at
> org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118)
> [apache-cassandra-3.0.11.jar:3.0.11]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [na:1.8.0_121]
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> [na:1.8.0_121]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> [na:1.8.0_121]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> [na:1.8.0_121]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_121]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_121]
> at
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
> [apache-cassandra-3.0.11.jar:3.0.11]
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_121]
> {code}
> 2. At this moment the repair coordinator hasn't received the response
> (MerkleTrees) for the node that was marked down. This means, RepairJob#run
> will never return because it waits for validations to finish:
> {code}
> // Wait for validation to complete
> Futures.getUnchecked(validations);
> {code}
> Be noted that all RepairJob's (as Runnable) run on a shared executor created
> in RepairRunnable#runMayThrow, while all snapshot, validation and sync'ing
> happen on a per-RepairSession "taskExecutor". The RepairJob#run will only
> return when it receives MerkleTrees (or null) from all endpoints for a given
> column family and token range.
> As evidence of the thread leak, below is from the thread dump. I can also get
> the same stack trace when simulating the same issue in dev environment.
> {code}
> "Repair#129:56" #406373 daemon prio=5 os_prio=0 tid=0x00007fc495028400
> nid=0x1a77d waiting on condition [0x00007fc021530000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000002d7c00198> (a
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
> at
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$4/725832346.run(Unknown
> Source)
> at java.lang.Thread.run(Thread.java:745)
> Locked ownable synchronizers:
> - <0x00000002d7c00230> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> {code}
> So here are two things:
> 1. For the thread leak itself, either we do something like below in
> RepairSession#terminate, or we use timed wait at the end of RepairJob#run.
> {code}
> for (ValidationTask validationTask : validating.values()) {
> validationTask.treesReceived(null);
> }
> validating.clear();
> {code}
> 2. Another question is, instead of waiting for synchronization (SyncTask) to
> finish, why we just wait for validation? Is it because we want to speed
> things up and anyway we have throttling on streaming?
> [~yukim] I'd love to get your comment. I'll check if this issue exists in
> other versions.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]