[jira] [Comment Edited] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

Joseph Lynch (JIRA) Wed, 23 Jan 2019 15:30:49 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750488#comment-16750488
 ]


Joseph Lynch edited comment on CASSANDRA-14922 at 1/23/19 11:29 PM:
--------------------------------------------------------------------

I'm getting trunk failures again as of e871903d, and after an [IRC 
discussion|https://wilderness.apache.org/channels/?f=cassandra-dev/2019-01-23] 
with Benedict it looks like we may be leaking:
 1. Off heap memory via some combination of the {{HintsBuffer}}, {{CommitLogs}} 
and the {{BufferPool}}
 2. File descriptors are potentially leaked and it's unclear that we clean 
those up

What is odd is that according to a profiler attached while running one of the 
dtests in a for loop, most of the leaked native memory is either pending 
finalization or unreachable from GC roots:

!LeakedNativeMemory.png!

Afaict both {{HintsBuffer}} and {{CommitLogs}} should be getting cleaned in the 
{{Instance::shutdown}} methods, although I don't think we clean the 
{{BufferPool}}.

Continuing to investigate this so that we can have green runs on trunk again.

*Edit:*

The test I'm running is applying this diff:
{noformat}
--- 
a/test/distributed/org/apache/cassandra/distributed/DistributedReadWritePathTest.java
+++ 
b/test/distributed/org/apache/cassandra/distributed/DistributedReadWritePathTest.java
@@ -27,6 +27,15 @@ import static 
org.apache.cassandra.net.MessagingService.Verb.READ_REPAIR;
 
 public class DistributedReadWritePathTest extends DistributedTestBase
 {
+    @Test
+    public void manyCoordinatedReads() throws Throwable
+    {
+        for (int i = 0; i < 20; i ++)
+        {
+            coordinatorRead();
+        }
+    }
+
     @Test
     public void coordinatorRead() throws Throwable
     {
{noformat}
Then I run the test suite and measure OS memory usage, for example if I re-wind 
trunk to the first patch (3dcde082) I see only 1.6GB allocated:
{noformat}
/usr/bin/time -f "mem=%K RSS=%M elapsed=%E cpu.sys=%S .user=%U" ant 
testclasslist -Dtest.classlistfile=/tmp/java_dtests_1_final.txt 
-Dtest.classlistprefix=distributed

testclasslist:
     [echo] Number of test runners: 1
[junit-timeout] Testsuite: 
org.apache.cassandra.distributed.DistributedReadWritePathTest
[junit-timeout] Testsuite: 
org.apache.cassandra.distributed.DistributedReadWritePathTest Tests run: 7, 
Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 103.473 sec
[junit-timeout] 
[junitreport] Processing 
/home/josephl/pg/cassandra/build/test/TESTS-TestSuites.xml to /tmp/null554485593
[junitreport] Loading stylesheet 
jar:file:/usr/share/ant/lib/ant-junit.jar!/org/apache/tools/ant/taskdefs/optional/junit/xsl/junit-frames.xsl
[junitreport] Transform time: 269ms
[junitreport] Deleting: /tmp/null554485593

BUILD SUCCESSFUL
Total time: 1 minute 47 seconds
mem=0 RSS=1606332 elapsed=1:47.37 cpu.sys=10.95 .user=159.99
{noformat}
But, if I use latest trunk (e871903d), I get 5GB:
{noformat}
testclasslist:
     [echo] Number of test runners: 1
    [mkdir] Created dir: /home/josephl/pg/cassandra/build/test/cassandra
    [mkdir] Created dir: /home/josephl/pg/cassandra/build/test/output
[junit-timeout] Testsuite: 
org.apache.cassandra.distributed.DistributedReadWritePathTest
[junit-timeout] Testsuite: 
org.apache.cassandra.distributed.DistributedReadWritePathTest Tests run: 7, 
Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 103.098 sec
[junit-timeout] 
[junitreport] Processing 
/home/josephl/pg/cassandra/build/test/TESTS-TestSuites.xml to /tmp/null126756458
[junitreport] Loading stylesheet 
jar:file:/usr/share/ant/lib/ant-junit.jar!/org/apache/tools/ant/taskdefs/optional/junit/xsl/junit-frames.xsl
[junitreport] Transform time: 330ms
[junitreport] Deleting: /tmp/null126756458

BUILD SUCCESSFUL
Total time: 2 minutes 15 seconds
mem=0 RSS=4962924 elapsed=2:15.93 cpu.sys=16.55 .user=284.28
{noformat}
Since the heap is 1GB, and we allocate about 256MB for the off-heap metaspace, 
1.6GB is much closer to what we expect than 5GB.

So ,.. something about the new executor system may be contributing. Continuing 
to dig in.


was (Author: jolynch):
I'm getting trunk failures again as of e871903d, and after an [IRC 
discussion|https://wilderness.apache.org/channels/?f=cassandra-dev/2019-01-23] 
with Benedict it looks like we may be leaking:
1. Off heap memory via some combination of the {{HintsBuffer}}, {{CommitLogs}} 
and the {{BufferPool}}
2. File descriptors are potentially leaked and it's unclear that we clean those 
up

What is odd is that according to a profiler attached while running one of the 
dtests in a for loop, most of the leaked native memory is either pending 
finalization or unreachable from GC roots:

 !LeakedNativeMemory.png! 

Afaict both {{HintsBuffer}} and {{CommitLogs}} should be getting cleaned in the 
{{Instance::shutdown}} methods, although I don't think we clean the 
{{BufferPool}}.

Continuing to investigate this so that we can have green runs on trunk again.

> In JVM dtests need to clean up after instance shutdown
> ------------------------------------------------------
>
>                 Key: CASSANDRA-14922
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest
>            Reporter: Joseph Lynch
>            Assignee: Joseph Lynch
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> LeakedNativeMemory.png, Leaking_Metrics_On_Shutdown.png, 
> MainClassRetaining.png, MemoryReclaimedFix.png, 
> Metaspace_Actually_Collected.png, OnlyThreeRootsLeft.png, 
> no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac291266660c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac291266660c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-14922) In JVM dtests need to clean up after instance shutdown

Reply via email to