[
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206250#comment-17206250
]
Alexander Dejanovski commented on CASSANDRA-15902:
--------------------------------------------------
So far, so good.
I've reproduced the issue in 3.11 using a low timeout in Reaper and repair
sessions started to pile up indefinitely:
{code:java}
% x_all "sudo su -s /bin/bash -c \"jstack \$(ps -ef |grep CassandraDaemon |grep
-v grep| cut -d' ' -f3) |grep 'Repair#'\" cassandra"
"Repair#11:1" #2193 daemon prio=5 os_prio=0 tid=0x00007fe15b19f530 nid=0x74d8
waiting on condition [0x00007fe145968000]
"Repair#10:1" #2154 daemon prio=5 os_prio=0 tid=0x00007fe16d7eceb0 nid=0x7471
waiting on condition [0x00007fe12bf12000]
"Repair#8:1" #2116 daemon prio=5 os_prio=0 tid=0x00007fe150316b40 nid=0x73f1
waiting on condition [0x00007fe12ce09000]
"Repair#7:1" #2084 daemon prio=5 os_prio=0 tid=0x00007fe150162f80 nid=0x73a9
waiting on condition [0x00007fe137894000]
"Repair#3:1" #1704 daemon prio=5 os_prio=0 tid=0x00007fe10f1b98d0 nid=0x6b9a
waiting on condition [0x00007fe1428fc000]
"Repair#14:1" #1778 daemon prio=5 os_prio=0 tid=0x0000565030775bb0 nid=0x6d58
waiting on condition [0x00007f8d08659000]
"Repair#9:1" #1573 daemon prio=5 os_prio=0 tid=0x00007f8d28770af0 nid=0x6b88
waiting on condition [0x00007f8d1ff39000]
"Repair#2:1" #1397 daemon prio=5 os_prio=0 tid=0x00007f8d2815eb70 nid=0x6851
waiting on condition [0x00007f8d1f9a0000]
"Repair#1:1" #1375 daemon prio=5 os_prio=0 tid=0x00007f8c67dcee40 nid=0x66a8
waiting on condition [0x00007f8d1cc6f000]
"Repair#1:1" #2412 daemon prio=5 os_prio=0 tid=0x00007fc61d2a38f0 nid=0x6ed9
waiting on condition [0x00007fc60736d000]
{code}
Then I built the patched version and waited again for repairs to time out for a
little while.
I never got more than one repair thread:
{code:java}
% x_all "sudo su -s /bin/bash -c \"jstack \$(ps -ef |grep CassandraDaemon |grep
-v grep| cut -d' ' -f2) |grep 'Repair#'\" cassandra"
"Repair#21:1" #682 daemon prio=5 os_prio=0 tid=0x00007f249854cc10 nid=0x7ced
waiting on condition [0x00007f246f779000]
{code}
I'm currently checking that repair still go through as expected with a regular
timeout and that is still running.
Once that's done, I'll check again against 3.0 and then perform a code review.
> OOM because repair session thread not closed when terminating repair
> --------------------------------------------------------------------
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
> Issue Type: Bug
> Components: Consistency/Repair
> Reporter: Swen Fuhrmann
> Assignee: Swen Fuhrmann
> Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by
> "sun.misc.Launcher$AppClassLoader @ 0x51a800000" occupy 8.445.684.480 (93,96
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
> 50 {noformat}
>
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x0000000012fc5000
> nid=0x542a waiting on condition [0x00007f81ee414000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000007939bcfc8> (a
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
> Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops
> the thread pool executor. It looks like that futures which are in progress
> will therefor never be completed and the repair thread waits forever and
> won't be finished.
>
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode)
> {noformat}
>
> The same issue described in this comment:
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]