[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-02-04 Thread Alex Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759846#comment-16759846
 ] 

Alex Petrov commented on CASSANDRA-14922:
-

I've investigated this one a bit further, but to be honest could not find any 
proof of leaks. I've also tried to make shutdown process more synchronous to 
reduce amount of in-flight memory, but this hasn't helped. I did also check the 
possible native memory leak. You are right there native byte buffer instances 
hanging after instance shutdown, but all of them as far as I can tell were 
pending finalization (see !Screen Shot 2019-01-30 at 15.47.13!).

I've tried running tests in non-constrained environment in loop 
[here|https://circleci.com/gh/ifesdjeen/cassandra/1197], and it seems to be 
passing fine after 10x100 runs. 

I made a small follow-up patch (however, these improvements are minor and do 
not have any impact on the end result) 
[here|https://github.com/apache/cassandra/compare/trunk...ifesdjeen:oom-improvements].
 In this patch I also disable in-jvm dtests for resource constrained 
environment.

If this looks ok, I suggest we do that and merge [~benedict]'s multi-version 
patch and continue writing tests.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> LeakedNativeMemory.png, Leaking_Metrics_On_Shutdown.png, 
> MainClassRetaining.png, MemoryReclaimedFix.png, 
> Metaspace_Actually_Collected.png, OnlyThreeRootsLeft.png, Screen Shot 
> 2019-01-30 at 15.46.35.png, Screen Shot 2019-01-30 at 15.47.13.png, 
> no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-24 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751659#comment-16751659
 ] 

Benedict commented on CASSANDRA-14922:
--

FWIW, I don't suspect anything.  I have had terminations due to file handle 
exhaustion locally, but I absolutely believe that we are likely to have some 
native memory leaks too.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> LeakedNativeMemory.png, Leaking_Metrics_On_Shutdown.png, 
> MainClassRetaining.png, MemoryReclaimedFix.png, 
> Metaspace_Actually_Collected.png, OnlyThreeRootsLeft.png, 
> no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-24 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751398#comment-16751398
 ] 

Joseph Lynch commented on CASSANDRA-14922:
--

[~ifesdjeen] they are no longer JVM level OOMs; I think they may be OS level 
kills (Benedict suspects open files, I suspect native memory) but I will 
confirm shortly using circleci + ssh.

For example on the latest trunk I see three test fails, the in-jvm dtests and 
two PagingTests: 
[7d138e20|https://circleci.com/gh/jolynch/cassandra/413#tests]. I'm also 
getting these failures in other bug fix branches e.g. 
[CASSANDRA-14096-trunk|https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-14096-trunk].

My java version running locally is 8u191:
{noformat}
$ java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
{noformat}
 

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> LeakedNativeMemory.png, Leaking_Metrics_On_Shutdown.png, 
> MainClassRetaining.png, MemoryReclaimedFix.png, 
> Metaspace_Actually_Collected.png, OnlyThreeRootsLeft.png, 
> no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-24 Thread Alex Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751020#comment-16751020
 ] 

Alex Petrov commented on CASSANDRA-14922:
-

[~jolynch] to be honest I do not see OOMs so far. I'm running up to 100 3-node 
cluster instances and it works fine. Which JDK are you using? Could that be a 
difference?

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> LeakedNativeMemory.png, Leaking_Metrics_On_Shutdown.png, 
> MainClassRetaining.png, MemoryReclaimedFix.png, 
> Metaspace_Actually_Collected.png, OnlyThreeRootsLeft.png, 
> no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-23 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750488#comment-16750488
 ] 

Joseph Lynch commented on CASSANDRA-14922:
--

I'm getting trunk failures again as of, and after an [IRC 
discussion|https://wilderness.apache.org/channels/?f=cassandra-dev/2019-01-23] 
with Benedict it looks like we may be leaking:
1. Off heap memory via some combination of the {{HintsBuffer}}, {{CommitLogs}} 
and the {{BufferPool}}
2. File descriptors are potentially leaked and it's unclear that we clean those 
up

What is odd is that according to a profiler attached while running one of the 
dtests in a for loop, most of the leaked native memory is either pending 
finalization or unreachable from GC roots:

 !LeakedNativeMemory.png! 

Afaict both {{HintsBuffer} and {{CommitLogs}} should be getting cleaned in the 
{{Instance::shutdown}} methods, although I don't think we clean the 
{{BufferPool}}.

Continuing to investigate this so that we can have green runs on trunk again.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> LeakedNativeMemory.png, Leaking_Metrics_On_Shutdown.png, 
> MainClassRetaining.png, MemoryReclaimedFix.png, 
> Metaspace_Actually_Collected.png, OnlyThreeRootsLeft.png, 
> no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-15 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743312#comment-16743312
 ] 

Benedict commented on CASSANDRA-14922:
--

Thanks, committed follow-up as 
[00fff3ee6e6c0142529de621bcaeee5790a0c235|https://github.com/apache/cassandra/commit/00fff3ee6e6c0142529de621bcaeee5790a0c235]

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-15 Thread Alex Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743066#comment-16743066
 ] 

Alex Petrov commented on CASSANDRA-14922:
-

+1 for the latest version, huge improvement!

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-14 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742246#comment-16742246
 ] 

Benedict commented on CASSANDRA-14922:
--

I've pushed an update with the suggested changes, and also reintroduced some 
sequencing to the executor shutdowns.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-14 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742047#comment-16742047
 ] 

Benedict commented on CASSANDRA-14922:
--

Thanks for the review.  I'll push an update shortly, with the test failures 
handled, and a slight tweak to not misleadingly return a SerializableX when it 
is wrapped with an executor (since this cannot be serialized).

bq. we can remove thread-local works in this case since we're passing the right 
class loader, making references unreachable. Is that right?

By 'passing' do you mean to the thread factory?  In which case, no, that 
shouldn't have an impact (I pass it only because it seems to make sense, it 
should work still without doing so, and seems to if I try).  All we're really 
doing is ensuring that any thread that evaluates anything inside one of the 
classes loaded by the instance's classloader is shutdown when the node is 
shutdown, by passing the work to a thread on this executor.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-14 Thread Alex Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741910#comment-16741910
 ] 

Alex Petrov commented on CASSANDRA-14922:
-

[~benedict] thank you for the patch. It does look like an improvement. 

I think that using node-local executor will work better for scenarios different 
than "fire up cluster, shut down the cluster". 

Just to confirm my understanding is correct: we can remove thread-local works 
in this case since we're passing the right class loader, making references 
unreachable. Is that right? I did test it and tests do confirm that passing 
"root" class loader would cause heap problems again, but maybe you have some 
more insight on that.

+1 to commit it in the scope of this ticket as a follow-up. However, there are 
some test failures in distributed tests, seemingly caused by different 
exception wrapping in the new version, we should fix them before committing.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-11 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740530#comment-16740530
 ] 

Benedict commented on CASSANDRA-14922:
--

I've pushed a branch 
[here|https://github.com/belliottsmith/cassandra/tree/14922-followup] that 
removes the cleanup of thread locals, and removes the cluster-wide executor, 
instead introducing a node-specific executor on which all invocations to that 
node happen.  This might introduce some slight penalty for evaluating tests, as 
we have multiple synchronisation points where control-flow is handed between 
threads, but it guarantees that all state remains isolated without ever 
burdening the user of the API to restrict their own program design.

We can elaborate on this later to make it ergonomic to use the executor, or 
perhaps to skip the executor, in order to provide efficiency where it matters.

I think it makes sense to classify this as part of the work for this ticket, 
and I will then backport it alongside the original patch for this ticket and 
the jvm-dtests, however it's primarily necessary for _future_ reasons:
 * 3.0 and earlier use {{ThreadLocal}}, not {{FastThreadLocal}}, so we need the 
earlier hackier reflection to fix on these
 * The current approach only cleans up the main thread - any other threads may 
themselves retain state indefinitely
 * Most importantly, when we begin stopping and starting nodes, and/or 
upgrading them, the cluster-wide executor service will retain thread local 
state from the stopped nodes, potentially indefinitely, probably leading to a 
steady leak of memory.

It seems easiest to introduce the future-proof approach now and port it to all 
the versions at once.

On a related note, there is CASSANDRA-14974, which simultaneously fixes a bug 
in our stdout capture, _and_ the impact this has to the cleanup of a 
{{TestCluster}} that occurs once this bug is fixed.  If somebody watching this 
ticket fancies reviewing that ticket, it would also be appreciated.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additio

[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-09 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738600#comment-16738600
 ] 

Benedict commented on CASSANDRA-14922:
--

On a related topic, it might be nice to eventually migrate to assigning a 
{{ThreadGroup}} to all our threads, so we can reliably manage them en masse.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-09 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738597#comment-16738597
 ] 

Benedict commented on CASSANDRA-14922:
--

In back porting this to 3.0, I wondered if it was worth opening a brief bit of 
discussion around alternative approaches to the {{ThreadLocal}} issue, for 
which the Netty approach does not anyway work in 3.0 (since we use regular 
{{ThreadLocal}}).

The fix introduced here assumes that we only access any cluster behaviour from 
the {{main}} thread.  An alternative approach would be to isolate all tests to 
a thread owned by the {{TestCluster}} (which we already have available to us, 
and we shutdown on {{close}}).

If this were the standard pattern for implementing any tests, we should not 
have an issue with the {{main}} thread retaining any references inside its 
{{ThreadLocal}}, but we also will create a pattern that extends to any tests 
written to utilise multi-threading, and hence not run against the {{main}} 
thread.

I've implemented this in my backport branch, and it is quite straight forward, 
however I am still chasing down other 3.0-era leaks (right now around our 
custom log4j integrations)

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-09 Thread Alex Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738278#comment-16738278
 ] 

Alex Petrov commented on CASSANDRA-14922:
-

Committed to trunk with 
[d5005627b02b4e716947fa05a40473368017c0f9|https://github.com/apache/cassandra/commit/d5005627b02b4e716947fa05a40473368017c0f9].

[CI run|https://circleci.com/workflow-run/eed85c4b-3a55-46bb-bad8-93c4c1e820f9]

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-08 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737726#comment-16737726
 ] 

Joseph Lynch commented on CASSANDRA-14922:
--

{quote}
Sure thing. I'll start the rebase tomorrow in that case. In that case, also, 
I've pushed my one nit from a quick look through here for Alex to look at, that 
I would have simply ninja'd in (with comment here, of course). This is just 
using the HintsBuffer.free method instead of directly invoking 
DirectByteBuffer.cleaner().clean().
{quote}
Ah cool, yea that appears to still work (and then we can leave the slab private 
in {{HintsBuffer}} as well.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-08 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737711#comment-16737711
 ] 

Benedict commented on CASSANDRA-14922:
--

bq.  can we wait for Alex to see the latest diff though... I've changed the 
patch a bit since he last looked.

Sure thing.  I'll start the rebase tomorrow in that case.  In that case, also, 
I've pushed my one nit from a quick look through 
[here|https://github.com/belliottsmith/cassandra/tree/14922] for Alex to look 
at, that I would have simply ninja'd in (with comment here, of course).  This 
is just using the {{HintsBuffer.free}} method instead of directly invoking 
{{DirectByteBuffer.cleaner().clean()}}.

bq. Regarding the backport, I am slightly concerned about the NativeLibrary 
changes being backported in their current form.

Thanks for highlighting this.  I'll be sure to take a close look at the 
behaviour on each version we backport to.  I expect there will be other places 
that need similar treatment to what you've done here, as well, so I need to 
double check anyway.

bq. I think the Soft references are coming from 
java.io.ObjectStreamClass$Caches.localDescs, but the object serder we're doing 
in InvokableInstance is a bit beyond my JVM skills I'm afraid.

No worries at all, thanks very much for reproducing this information here for 
posterity.  If we ever want to clean this up, it would probably be easiest to 
simply avoid ser/deser entirely (or use custom ser/deser), but your approach is 
a much more suitable compromise for now.  Thanks again also for all the 
investigative work to plug these gaps.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-08 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737614#comment-16737614
 ] 

Joseph Lynch commented on CASSANDRA-14922:
--

[~benedict],

Awesome, can we wait for Alex to see the latest diff though with the reflection 
removed in favor of his proposed fast local thread pool cleanup method? I've 
changed the patch a bit since he last looked.

Regarding the backport, I am slightly concerned about the NativeLibrary changes 
being backported in their current form. From my reading of the JNA source code 
in version 4.2.2 in trunk we're just skipping the cache by using 
[NativeLibrary::getInstance|https://github.com/java-native-access/jna/blob/4bcc6191c5467361b5c1f12fb5797354cc3aa897/src/com/sun/jna/NativeLibrary.java#L341]
 directly and passing it to 
[Native::register(NativeLibrary)|https://github.com/java-native-access/jna/blob/4bcc6191c5467361b5c1f12fb5797354cc3aa897/src/com/sun/jna/Native.java#L1260]
 instead of having 
[Native::register(String)|https://github.com/java-native-access/jna/blob/4bcc6191c5467361b5c1f12fb5797354cc3aa897/src/com/sun/jna/Native.java#L1251]
 do that for us and cache the classloader along the way 
[here|https://github.com/java-native-access/jna/blob/4bcc6191c5467361b5c1f12fb5797354cc3aa897/src/com/sun/jna/NativeLibrary.java#L363].
 But, if I'm wrong it's unlikely we'd know, as while our tests cover Linux 
pretty thoroughly, darwin/windows are less covered.

Also I forgot to respond to your question about SoftReferences here, did it on 
IRC but not here.
{quote}Do you know where the soft references originate? I wonder if there's 
anything we can do to simply eliminate them.
{quote}
I think the Soft references are coming from 
{{java.io.ObjectStreamClass$Caches.localDescs}}, but the object serder we're 
doing in {{InvokableInstance}} is a bit beyond my JVM skills I'm afraid. I 
don't know how we can prevent the object serializations from caching the class 
descriptions... Perhaps the JVM option is sufficient for now and if we don't 
like that going forward we can dive in more?

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-08 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737557#comment-16737557
 ] 

Benedict commented on CASSANDRA-14922:
--

Marking 'Ready to Commit' given [~ifesdjeen]'s comments.  I'll give it another 
quick once over then commit, so I can rebase CASSANDRA-14931 and 
CASSANDRA-14937.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-08 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737552#comment-16737552
 ] 

Joseph Lynch commented on CASSANDRA-14922:
--

{quote}The patch looks good, and I'd say [~jolynch] let's merge it,
{quote}
Ok, yea I agree let's merge what we have so that the unit tests can pass on 
trunk again and we can follow up in CASSANDRA-14969. I've put up a patch 
against trunk with what we have so far (including your changes from the demo 
branch).

 
||trunk||
|[d361ba9b|https://github.com/apache/cassandra/commit/d361ba9b846cf6dc9c3ef5daca7aab5a39ec8fcc]|
|[!https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-14922.png?circle-token=
 
1102a59698d04899ec971dd36e925928f7b521f5!|https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-14922]|

 

If I attach a profiler during an intellij "run this test until it fails" mode I 
can see that the memory is indeed getting cleaned up:

!MemoryReclaimedFix.png!

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2019-01-08 Thread Alex Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737083#comment-16737083
 ] 

Alex Petrov commented on CASSANDRA-14922:
-

The patch looks good, and I'd say [~jolynch] let's merge it, since tests have 
been failing for a while now, unless there's something else you wanted to 
include in the patch immediately. 

I've had a couple of minor suggestions. All of the issues are easier to see / 
reproducible with a very small heap, ~256Mb: 
  * Hints are leaking direct memory 
  * Threadlocals are leaked 
  * FastThreadLocalThread thread locals are leaked (sorry for a tongue-twister)

I've put together a small 
[demo|https://github.com/apache/cassandra/compare/trunk...ifesdjeen:CASSANDRA-14922]
 just for demonstration purposes if you wanted to see the impact of suggested 
changes.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> Metaspace_Actually_Collected.png, OnlyThreeRootsLeft.png, 
> no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2018-12-13 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720109#comment-16720109
 ] 

Benedict commented on CASSANDRA-14922:
--

Do you know where the soft references originate?  I wonder if there's anything 
we can do to simply eliminate them.

Short of that, I'm comfortable setting the SoftRefLRUPolicyMSPerMB - the 
default seems pretty strange (of 1s per MB), and probably does not seem to play 
well with Metaspace, since this probably doesn't contribute to the input to the 
SoftRefLRUPolicyMSPerMB.  So we could have Gigabytes of free space, and judge 
that we need the referent to have been unused for hours before we collect it.

It's been a while since I sperlunked in the JDK source, but it might be worth 
taking a look to find out exactly what number if provides, but honestly I don't 
really see a problem with setting the value to zero.  The worst impact of soft 
references being freed too-eagrerly is performance, it should never be 
functional.  We don't depend on them directly in C*, so it would only be via 
the JDK.  Worst case scenario, we need to investigate again at some later date 
an issue with the tests.

Nice catch on the Native API also.

Overall this looks like excellent work (Though I'll leave a full review to 
Alex, since he's marked himself reviewer)

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> Metaspace_Actually_Collected.png, OnlyThreeRootsLeft.png, 
> no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2018-12-12 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719715#comment-16719715
 ] 

Joseph Lynch commented on CASSANDRA-14922:
--

I found a better workaround (in 
[bfd8c328|https://github.com/apache/cassandra/commit/bfd8c328f92e6bd2c82269a048f6b868179f484a])
 for the jna {{Native}} Classloader leak is to just call the 
[register|https://github.com/java-native-access/jna/blob/365f6b343a92427e7890cae0c16df7b6c4c254d4/src/com/sun/jna/Native.java#L1446]
 API that doesn't add the {{InstanceClassLoader}} to a global static map in 
{{Native}}. This still loads the libraries and doesn't leak the ClassLoader 
into a global static map. If we're concerned about the changes to the 
{{NativeLibrary}} I can pursue other option.

I also believe I know why the JVM is still not releasing the classloaders even 
after they are no longer referenced: 
[SoftReferences|https://docs.oracle.com/javase/8/docs/api/java/lang/ref/SoftReference.html]
 which are apparently just ... not collected. It looks like Metaspace just 
grows without bound until we run the machine OOM (not the JVM, the machine).

I tried the following after eliminating all strong references:
 1. No JVM tuning. This _did not_ work as the metaspace would grow without 
bound and eventually OOM the machine (not the JVM, the machine).
 2. {{-XX:SoftRefLRUPolicyMSPerMB=0}} and or {{-XX:MaxMetaspaceSize=256M}}. 
This *worked* and caused the ClassLoaders to be released. With MaxMetaspaceSize 
we also successfully limit the neww
 3. {{-XX:MaxMetaspaceSize=256M}} and or {{-XX:MetaspaceSize=100M}}. This _did 
not_ work. Setting the Max size would cause an {{java.lang.OutOfMemoryError: 
Metaspace}}, setting just the size didn't do anything (machine would just OOM 
itself anyways)
 4. {{-XX:UseConcMarkSweepGC}} and or {{-XX:CMSClassUnloadingEnabled}}. This 
_did not_ work. Again the metaspace just grew without bound and ran out of 
memory.

I think that running all unit tests under #2 is a pretty reasonable thing to do 
(we shouldn't be leaking Metaspace imo). I've pursued that option in 
[86f982b1|https://github.com/apache/cassandra/commit/86f982b19ba23fc5efcd7421449af4a84d93711b]
 and it appears to work. I also found another thread that leaks only during ant 
test runs which is fixed in 
[512d15b5f|https://github.com/apache/cassandra/commit/512d15b5f08bda1b0dc8a952ce950b5c4341f992].

[~ifesdjeen]/[~benedict] Do you guys think it's acceptable to run all unit 
tests with the off-heap Metaspace limited to 256 megabytes and 
{{SoftRefLRUPolicyMSPerMB}} set to 0 to tell the GC to actually collect soft 
references? It's different than the status quo but I think we probably 
shouldn't leak metaspace (aka permgen from java7/6) and we probably shouldn't 
rely on soft references for things to work generally... I checked and both 
options are available in openjdk8+, but I'm not sure if there is a better way.

Other than the ThreadLocal reflection hacks I think the branch is almost ready 
to squash and merge, what do you think?

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
>

[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2018-12-12 Thread Benedict (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718972#comment-16718972
 ] 

Benedict commented on CASSANDRA-14922:
--

[~jolynch]: nice work, somehow I hadn't spotted this thread until now.

Unfortunately I don't have anything useful to add, except a shot-in-the-dark 
that maybe a full GC is needed?  This is apparently the case for G1GC; not sure 
about other GCs, or which you are using.  It's probably not helpful, but you 
could also look at {{-XX:+TraceClassUnloading}}

Mostly just wanted to say thanks for taking the time to track all of this down.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2018-12-11 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718514#comment-16718514
 ] 

Joseph Lynch commented on CASSANDRA-14922:
--

I think I figured it out, it turns out that Native holds a static map that 
contains the custom classloaders called {{options}}, which is what the last two 
references were talking about. I tried cleanly unloading or unregistering those 
but that still doesn't remove the reference to the ClassLoader in the 
{{options}} map so I just hacked some reflection based solution together 
(similar to what I did for the {{ThreadLocal}} variables) in  
[6573176|https://github.com/jolynch/cassandra/commit/6573176ef3ed9601ce7f02602c964d478c6a5741].
 According to Yourkit there are now no more strong references to the 
{{InstanceClassLoader}} instances (attached).

We leak a lot less memory with my latest patches, but ... for some reason the 
class loaders aren't going away and are still retaining some heap, just a lot 
let ... Still something missing, maybe something like 
{{CMSClassUnloadingEnabled}} or some such?

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> OnlyThreeRootsLeft.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2018-12-11 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718243#comment-16718243
 ] 

Joseph Lynch commented on CASSANDRA-14922:
--

Nailed the MessageSink leak and properly shuts down logback now, and we clean 
up ThreadLocal variables which afaict means that there are no additional 
threads that could hold references ... and at least now [my 
branch|https://github.com/jolynch/cassandra/tree/CASSANDRA-14922] can pass unit 
tests in 
[circleci|https://circleci.com/workflow-run/6fb24842-bbb8-4aac-b137-4007729bf39a]
 so that's good.

Unfortunately I think we're still leaking, according to my heap dump analysis 
the _main_ thread ends up with a strong reference to the 
{{InstanceClassLoaders}}, which is really odd since the InstanceClassLoader 
doesn't have any static state so I don't see how those don't go out of scope 
when the test methods finish.

[~ifesdjeen] tbh at this point I'm pretty stuck. My understanding is that the 
{{InstanceClassLoaders}} should go out of scope after each test method, which 
should allow everything to GC now that there are no more live references from 
threads to the {{InstanceClassLoaders}}, but that's not happening. I think it 
might be related to passing in the current threads classloader 
[here|https://github.com/jolynch/cassandra/blob/1ca6ad3c41f4456b674e13883e0df0091f638564/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L241]
 but I'm not sure how to achieve what we need there without that.

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> OnlyThreeRootsLeft.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2018-12-07 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713487#comment-16713487
 ] 

Joseph Lynch commented on CASSANDRA-14922:
--

Alright, I think I'm narrowing in on this. I've managed to get all the threads 
to die over in [a 
branch|https://github.com/jolynch/cassandra/tree/CASSANDRA-14922] but we're 
still leaking all the static state through the {{InstanceClassLoader}} s. I 
think I've narrowed it down to just three remaining references (and I _think_ 
only one of them is a strong reference), details attached in the screenshots.

We basically just need to kill that last strong reference and I believe that 
the whole {{InstanceClassLoader}} should become collectible at that point (even 
with all the static state and self references to the classloaders should be ok 
since it'll be cut off at the root, I think).

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, OnlyThreeRootsLeft.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2018-12-04 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709389#comment-16709389
 ] 

Joseph Lynch commented on CASSANDRA-14922:
--

Definitely have to keep the shutdown fast, I think we just have a deadlock or 
something like that in the cleanups where we're not actually cleaning up. I 
tried adding in {{Keyspace::clear}} calls in the shutdown hook but that 
deadlocked on the single STPE in {{nonPeriodicTasks}}, still trying to figure 
out why we can't drop all the CFs and release all the metrics but making 
progress :)

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Attachments: Leaking_Metrics_On_Shutdown.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

2018-12-04 Thread Alex Petrov (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709185#comment-16709185
 ] 

Alex Petrov commented on CASSANDRA-14922:
-

{{close}} is not waiting for full instance shutdown (same as instance shutdown 
does not wait for executor shutdown) wasn't an oversight. It makes shutdown 
significantly faster and allows shutting down a cluster much quicker. 

I do agree we should fix leaks and make sure things shutdown properly and we 
definitely have to do what we can to make it happen. But I also think we should 
keep it performant. If there's no way and clusters churn quicker than we can 
shutdown them, we can make a switch. 

Another thing we might want to consider is reusing cluster instance for 
multiple tests as I think this will most frequently be the case. We can make it 
in a follow-up ticket, since it is, however helpful, also orthogonal. 

> In JVM dtests need to clean up after instance shutdown
> --
>
> Key: CASSANDRA-14922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Testing
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Minor
> Attachments: Leaking_Metrics_On_Shutdown.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac29120c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org