[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719715#comment-16719715
 ] 

Joseph Lynch edited comment on CASSANDRA-14922 at 12/13/18 3:51 AM:
--------------------------------------------------------------------

I found a better workaround (in 
[bfd8c328|https://github.com/apache/cassandra/commit/bfd8c328f92e6bd2c82269a048f6b868179f484a])
 for the jna {{Native}} Classloader leak is to just call the 
[register|https://github.com/java-native-access/jna/blob/365f6b343a92427e7890cae0c16df7b6c4c254d4/src/com/sun/jna/Native.java#L1446]
 API that doesn't add the {{InstanceClassLoader}} to a global static map in 
{{Native}}. This still loads the libraries and doesn't leak the ClassLoader 
into a global static map. If we're concerned about the changes to the 
{{NativeLibrary}} I can pursue other option. I also found another thread that 
leaks only during ant test runs which is fixed in 
[512d15b5f|https://github.com/apache/cassandra/commit/512d15b5f08bda1b0dc8a952ce950b5c4341f992].

Furthermore I believe I know why the JVM was still not releasing the 
classloaders even after they are no longer referenced: 
[SoftReferences|https://docs.oracle.com/javase/8/docs/api/java/lang/ref/SoftReference.html]
 are apparently just ... not collected. It looks like Metaspace just grows 
without bound until we run the machine OOM (not the JVM, the machine).

I tried the following after eliminating all strong references:
 1. No JVM tuning. This _did not_ work as the metaspace would grow without 
bound and eventually OOM the machine (not the JVM, the machine).
 2. {{-XX:SoftRefLRUPolicyMSPerMB=0}} and or {{-XX:MaxMetaspaceSize=256M}}. 
This *worked* and caused the ClassLoaders to be released. With MaxMetaspaceSize 
we also successfully limit the offheap Metaspace which is nice.
 3. {{-XX:MaxMetaspaceSize=256M}} and or {{-XX:MetaspaceSize=100M}}. This _did 
not_ work. Setting the Max size would cause an {{java.lang.OutOfMemoryError: 
Metaspace}}, setting just the size didn't do anything (machine would just OOM 
itself anyways)
 4. {{-XX:UseConcMarkSweepGC}} and or {{-XX:CMSClassUnloadingEnabled}}. This 
_did not_ work. Again the metaspace just grew without bound and ran out of 
memory.

I think that running all unit tests under #2 is a pretty reasonable thing to do 
(we shouldn't be leaking Metaspace imo). I've pursued that option in 
[86f982b1|https://github.com/apache/cassandra/commit/86f982b19ba23fc5efcd7421449af4a84d93711b]
 and it appears to work looking at my 
[profiler|https://issues.apache.org/jira/secure/attachment/12951598/Metaspace_Actually_Collected.png].

[~ifesdjeen]/[~benedict] Do you guys think it's acceptable to run all unit 
tests with the off-heap Metaspace limited to 256 megabytes and 
{{SoftRefLRUPolicyMSPerMB}} set to 0 to tell the GC to actually collect soft 
references? It's different than the status quo but I think we probably 
shouldn't leak metaspace (aka permgen from java7/6) and we probably shouldn't 
rely on soft references for things to work generally... I checked and both 
options are available in openjdk8+, but I'm not sure if there is a better way.

Other than the ThreadLocal reflection hacks I think the branch is almost ready 
to squash and merge, what do you think?


was (Author: jolynch):
I found a better workaround (in 
[bfd8c328|https://github.com/apache/cassandra/commit/bfd8c328f92e6bd2c82269a048f6b868179f484a])
 for the jna {{Native}} Classloader leak is to just call the 
[register|https://github.com/java-native-access/jna/blob/365f6b343a92427e7890cae0c16df7b6c4c254d4/src/com/sun/jna/Native.java#L1446]
 API that doesn't add the {{InstanceClassLoader}} to a global static map in 
{{Native}}. This still loads the libraries and doesn't leak the ClassLoader 
into a global static map. If we're concerned about the changes to the 
{{NativeLibrary}} I can pursue other option. I also found another thread that 
leaks only during ant test runs which is fixed in 
[512d15b5f|https://github.com/apache/cassandra/commit/512d15b5f08bda1b0dc8a952ce950b5c4341f992].

Furthermore I believe I know why the JVM was still not releasing the 
classloaders even after they are no longer referenced: 
[SoftReferences|https://docs.oracle.com/javase/8/docs/api/java/lang/ref/SoftReference.html]
 are apparently just ... not collected. It looks like Metaspace just grows 
without bound until we run the machine OOM (not the JVM, the machine).

I tried the following after eliminating all strong references:
 1. No JVM tuning. This _did not_ work as the metaspace would grow without 
bound and eventually OOM the machine (not the JVM, the machine).
 2. {{-XX:SoftRefLRUPolicyMSPerMB=0}} and or {{-XX:MaxMetaspaceSize=256M}}. 
This *worked* and caused the ClassLoaders to be released. With MaxMetaspaceSize 
we also successfully limit the neww
 3. {{-XX:MaxMetaspaceSize=256M}} and or {{-XX:MetaspaceSize=100M}}. This _did 
not_ work. Setting the Max size would cause an {{java.lang.OutOfMemoryError: 
Metaspace}}, setting just the size didn't do anything (machine would just OOM 
itself anyways)
 4. {{-XX:UseConcMarkSweepGC}} and or {{-XX:CMSClassUnloadingEnabled}}. This 
_did not_ work. Again the metaspace just grew without bound and ran out of 
memory.

I think that running all unit tests under #2 is a pretty reasonable thing to do 
(we shouldn't be leaking Metaspace imo). I've pursued that option in 
[86f982b1|https://github.com/apache/cassandra/commit/86f982b19ba23fc5efcd7421449af4a84d93711b]
 and it appears to work looking at my 
[profiler|https://issues.apache.org/jira/secure/attachment/12951598/Metaspace_Actually_Collected.png].

[~ifesdjeen]/[~benedict] Do you guys think it's acceptable to run all unit 
tests with the off-heap Metaspace limited to 256 megabytes and 
{{SoftRefLRUPolicyMSPerMB}} set to 0 to tell the GC to actually collect soft 
references? It's different than the status quo but I think we probably 
shouldn't leak metaspace (aka permgen from java7/6) and we probably shouldn't 
rely on soft references for things to work generally... I checked and both 
options are available in openjdk8+, but I'm not sure if there is a better way.

Other than the ThreadLocal reflection hacks I think the branch is almost ready 
to squash and merge, what do you think?

> In JVM dtests need to clean up after instance shutdown
> ------------------------------------------------------
>
>                 Key: CASSANDRA-14922
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Testing
>            Reporter: Joseph Lynch
>            Assignee: Joseph Lynch
>            Priority: Minor
>         Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> Metaspace_Actually_Collected.png, OnlyThreeRootsLeft.png, 
> no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac291266660c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac291266660c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to