Swen Fuhrmann created CASSANDRA-15902:
-----------------------------------------

             Summary: OOM because repair session thread not closed when 
terminating repair
                 Key: CASSANDRA-15902
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
             Project: Cassandra
          Issue Type: Bug
          Components: Consistency/Repair
            Reporter: Swen Fuhrmann


In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper cancel repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see >100 instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see 
lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
      50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x0000000012fc5000 nid=0x542a 
waiting on condition [0x00007f81ee414000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007939bcfc8> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
        at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
        at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
        at 
com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
        at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
        at 
org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
 Source)
        at java.lang.Thread.run(Thread.java:748) {noformat}
 

Thats the line where the threads stuck:
{noformat}
// Wait for validation to complete
Futures.getUnchecked(validations); {noformat}
 

The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops the 
thread pool executor. It looks like that futures which are in progress will 
therefor never be completed and the repair thread waits forever and won't be 
finished.

 

Environment:

Cassandra version: 3.11.4

Cassandra Reaper: 1.4.0

Java Runtime:
{noformat}
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) {noformat}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to