[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-10-28 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15902:
---
Fix Version/s: (was: 4.0-beta)
   4.0-beta3
   4.0

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 4.0, 3.0.23, 3.11.9, 4.0-beta3
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-10-28 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-15902:
-
  Fix Version/s: (was: 3.11.x)
 (was: 3.0.x)
 3.11.9
 3.0.23
 4.0-beta
  Since Version: 3.0.0
Source Control Link: 
https://github.com/apache/cassandra/commit/45ad38fb5aec76418589c07d88fd0ca27fb430f4
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

Great work, committed.  Thanks!

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 4.0-beta, 3.0.23, 3.11.9
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> 

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-10-28 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-15902:
-
Reviewers: Alexander Dejanovski, Brandon Williams  (was: Alexander 
Dejanovski)

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, 

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-10-28 Thread Alexander Dejanovski (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15902:
-
Status: Ready to Commit  (was: Review In Progress)

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: 

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-10-28 Thread Alexander Dejanovski (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15902:
-
Status: Patch Available  (was: Review In Progress)

LGTM

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: 

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-10-28 Thread Alexander Dejanovski (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15902:
-
Reviewers: Alexander Dejanovski, Alexander Dejanovski  (was: Alexander 
Dejanovski)
   Alexander Dejanovski, Alexander Dejanovski  (was: Alexander 
Dejanovski)
   Status: Review In Progress  (was: Patch Available)

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-09-28 Thread Alexander Dejanovski (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-15902:
-
Reviewers: Alexander Dejanovski, Alexander Dejanovski  (was: Alexander 
Dejanovski)
   Alexander Dejanovski, Alexander Dejanovski
   Status: Review In Progress  (was: Patch Available)

Starting testing and review.

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-09-10 Thread Swen Fuhrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swen Fuhrmann updated CASSANDRA-15902:
--
Test and Documentation Plan: 
* Add unit test exposing the issue
* For trunk, add only regression test as unit test 
 Status: Patch Available  (was: Open)

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-09-10 Thread Marcus Eriksson (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15902:

 Bug Category: Parent values: Degradation(12984)Level 1 values: Resource 
Management(12995)
   Complexity: Normal
Discovered By: User Report
 Severity: Normal
   Status: Open  (was: Triage Needed)

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-09-03 Thread Swen Fuhrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swen Fuhrmann updated CASSANDRA-15902:
--
Fix Version/s: 3.0.x

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-06-30 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-15902:
---
Fix Version/s: 3.11.x

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Fix For: 3.11.x
>
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4 and 3.11.6
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-06-30 Thread Swen Fuhrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swen Fuhrmann updated CASSANDRA-15902:
--
Description: 
In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper terminate repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see lot of instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
{noformat}
119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
"sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 %) 
bytes. {noformat}
In the thread dump we see lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
  50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 nid=0x542a 
waiting on condition [0x7f81ee414000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007939bcfc8> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
at 
com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
at 
org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
 Source)
at java.lang.Thread.run(Thread.java:748) {noformat}
 

Thats the line where the threads stuck:
{noformat}
// Wait for validation to complete
Futures.getUnchecked(validations); {noformat}
 

The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops the 
thread pool executor. It looks like that futures which are in progress will 
therefor never be completed and the repair thread waits forever and won't be 
finished.

 

Environment:

Cassandra version: 3.11.4 and 3.11.6

Cassandra Reaper: 1.4.0

JVM memory settings:
{noformat}
-Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
-XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
on another cluster with same issue:
{noformat}
-Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
-XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
Java Runtime:
{noformat}
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) {noformat}
 

The same issue described in this comment: 
https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973

As suggested in the comments I created this new specific ticket.

  was:
In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper terminate repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see lot of instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
{noformat}
119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
"sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 %) 
bytes. {noformat}
In the thread dump we see lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
  50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 nid=0x542a 
waiting on condition [0x7f81ee414000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-06-25 Thread Swen Fuhrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swen Fuhrmann updated CASSANDRA-15902:
--
Attachment: heap-mem-histo.txt

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Attachments: heap-mem-histo.txt, repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper terminate repairs with a JMX 
> call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because 
> reaching timeout of 30 min.
> In the memory heap dump we see lot of instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
> {noformat}
> 119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 
> %) bytes. {noformat}
> In the thread dump we see lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4
> Cassandra Reaper: 1.4.0
> JVM memory settings:
> {noformat}
> -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> on another cluster with same issue:
> {noformat}
> -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
> -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> The same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-06-25 Thread Swen Fuhrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swen Fuhrmann updated CASSANDRA-15902:
--
Description: 
In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper terminate repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see lot of instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
{noformat}
119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
"sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 %) 
bytes. {noformat}
In the thread dump we see lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
  50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 nid=0x542a 
waiting on condition [0x7f81ee414000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007939bcfc8> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
at 
com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
at 
org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
 Source)
at java.lang.Thread.run(Thread.java:748) {noformat}
 

Thats the line where the threads stuck:
{noformat}
// Wait for validation to complete
Futures.getUnchecked(validations); {noformat}
 

The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops the 
thread pool executor. It looks like that futures which are in progress will 
therefor never be completed and the repair thread waits forever and won't be 
finished.

 

Environment:

Cassandra version: 3.11.4

Cassandra Reaper: 1.4.0

JVM memory settings:
{noformat}
-Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
-XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
on another cluster with same issue:
{noformat}
-Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
-XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
Java Runtime:
{noformat}
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) {noformat}
 

The same issue described in this comment: 
https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973

As suggested in the comments I created this new specific ticket.

  was:
In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper terminate repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see lot of instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
{noformat}
119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
"sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 %) 
bytes. {noformat}
In the thread dump we see lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
  50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 nid=0x542a 
waiting on condition [0x7f81ee414000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-06-25 Thread Swen Fuhrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swen Fuhrmann updated CASSANDRA-15902:
--
Description: 
In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper terminate repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see lot of instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}} occupy most of the memory:
{noformat}
119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by 
"sun.misc.Launcher$AppClassLoader @ 0x51a80" occupy 8.445.684.480 (93,96 %) 
bytes. {noformat}
In the thread dump we see lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
  50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 nid=0x542a 
waiting on condition [0x7f81ee414000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007939bcfc8> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
at 
com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
at 
org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
 Source)
at java.lang.Thread.run(Thread.java:748) {noformat}
 

Thats the line where the threads stuck:
{noformat}
// Wait for validation to complete
Futures.getUnchecked(validations); {noformat}
 

The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops the 
thread pool executor. It looks like that futures which are in progress will 
therefor never be completed and the repair thread waits forever and won't be 
finished.

 

Environment:

Cassandra version: 3.11.4

Cassandra Reaper: 1.4.0

JVM memory settings:
{noformat}
-Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 
-XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M {noformat}
Java Runtime:
{noformat}
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) {noformat}
 

The same issue described in this comment: 
https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973

As suggested in the comments I created this new specific ticket.

  was:
In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper cancel repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see >100 instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see 
lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
  50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 nid=0x542a 
waiting on condition [0x7f81ee414000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007939bcfc8> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-06-25 Thread Swen Fuhrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swen Fuhrmann updated CASSANDRA-15902:
--
Description: 
In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper cancel repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see >100 instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see 
lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
  50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 nid=0x542a 
waiting on condition [0x7f81ee414000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007939bcfc8> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
at 
com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
at 
org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
 Source)
at java.lang.Thread.run(Thread.java:748) {noformat}
 

Thats the line where the threads stuck:
{noformat}
// Wait for validation to complete
Futures.getUnchecked(validations); {noformat}
 

The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops the 
thread pool executor. It looks like that futures which are in progress will 
therefor never be completed and the repair thread waits forever and won't be 
finished.

 

Environment:

Cassandra version: 3.11.4

Cassandra Reaper: 1.4.0

Java Runtime:
{noformat}
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) {noformat}
 

The same issue described in this comment: 
https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973

As suggested in the comments I created this new specific ticket.

  was:
In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper cancel repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see >100 instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see 
lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
  50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 nid=0x542a 
waiting on condition [0x7f81ee414000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007939bcfc8> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-06-25 Thread Swen Fuhrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swen Fuhrmann updated CASSANDRA-15902:
--
Attachment: repair-terminated.txt

> OOM because repair session thread not closed when terminating repair
> 
>
> Key: CASSANDRA-15902
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15902
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair
>Reporter: Swen Fuhrmann
>Assignee: Swen Fuhrmann
>Priority: Normal
> Attachments: repair-terminated.txt
>
>
> In our cluster, after a while some nodes running slowly out of memory. On 
> that nodes we observed that Cassandra Reaper cancel repairs with a JMX call 
> to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
> timeout of 30 min.
> In the memory heap dump we see >100 instances of 
> {{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see 
> lot of repair threads:
> {noformat}
> grep "Repair#" threaddump.txt | wc -l
>   50 {noformat}
>  
> The repair jobs are waiting for the validation to finish:
> {noformat}
> "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 
> nid=0x542a waiting on condition [0x7f81ee414000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007939bcfc8> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
> at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
> at 
> com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
> at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
>  Source)
> at java.lang.Thread.run(Thread.java:748) {noformat}
>  
> Thats the line where the threads stuck:
> {noformat}
> // Wait for validation to complete
> Futures.getUnchecked(validations); {noformat}
>  
> The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops 
> the thread pool executor. It looks like that futures which are in progress 
> will therefor never be completed and the repair thread waits forever and 
> won't be finished.
>  
> Environment:
> Cassandra version: 3.11.4
> Cassandra Reaper: 1.4.0
> Java Runtime:
> {noformat}
> openjdk version "1.8.0_212"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 
> {noformat}
>  
> Here is the same issue described in this comment: 
> https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973
> As suggested in the comments I created this new specific ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-06-25 Thread Swen Fuhrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swen Fuhrmann updated CASSANDRA-15902:
--
Description: 
In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper cancel repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see >100 instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see 
lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
  50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 nid=0x542a 
waiting on condition [0x7f81ee414000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007939bcfc8> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
at 
com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
at 
org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
 Source)
at java.lang.Thread.run(Thread.java:748) {noformat}
 

Thats the line where the threads stuck:
{noformat}
// Wait for validation to complete
Futures.getUnchecked(validations); {noformat}
 

The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops the 
thread pool executor. It looks like that futures which are in progress will 
therefor never be completed and the repair thread waits forever and won't be 
finished.

 

Environment:

Cassandra version: 3.11.4

Cassandra Reaper: 1.4.0

Java Runtime:
{noformat}
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) {noformat}
 

Here is the same issue described in this comment: 
https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973

As suggested in the comments I created this new specific ticket.

  was:
In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper cancel repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see >100 instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see 
lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
  50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 nid=0x542a 
waiting on condition [0x7f81ee414000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007939bcfc8> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 

[jira] [Updated] (CASSANDRA-15902) OOM because repair session thread not closed when terminating repair

2020-06-25 Thread Swen Fuhrmann (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swen Fuhrmann updated CASSANDRA-15902:
--
Description: 
In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper cancel repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see >100 instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see 
lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
  50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 nid=0x542a 
waiting on condition [0x7f81ee414000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007939bcfc8> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
at 
com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
at 
org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown
 Source)
at java.lang.Thread.run(Thread.java:748) {noformat}
 

Thats the line where the threads stuck:
{noformat}
// Wait for validation to complete
Futures.getUnchecked(validations); {noformat}
 

The call to {{StorageServiceMBean.forceTerminateAllRepairSessions()}} stops the 
thread pool executor. It looks like that futures which are in progress will 
therefor never be completed and the repair thread waits forever and won't be 
finished.

 

Environment:

Cassandra version: 3.11.4

Cassandra Reaper: 1.4.0

Java Runtime:
{noformat}
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) {noformat}
 

Here is the same issue described: 
https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973

As suggested in the comments I created this new specific ticket.

  was:
In our cluster, after a while some nodes running slowly out of memory. On that 
nodes we observed that Cassandra Reaper cancel repairs with a JMX call to 
{{StorageServiceMBean.forceTerminateAllRepairSessions()}} because reaching 
timeout of 30 min.

In the memory heap dump we see >100 instances of 
{{io.netty.util.concurrent.FastThreadLocalThread}}. In the thread dump we see 
lot of repair threads:
{noformat}
grep "Repair#" threaddump.txt | wc -l
  50 {noformat}
 

The repair jobs are waiting for the validation to finish:
{noformat}
"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x12fc5000 nid=0x542a 
waiting on condition [0x7f81ee414000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007939bcfc8> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at