[ https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750488#comment-16750488 ]
Joseph Lynch edited comment on CASSANDRA-14922 at 1/23/19 11:29 PM: -------------------------------------------------------------------- I'm getting trunk failures again as of e871903d, and after an [IRC discussion|https://wilderness.apache.org/channels/?f=cassandra-dev/2019-01-23] with Benedict it looks like we may be leaking: 1. Off heap memory via some combination of the {{HintsBuffer}}, {{CommitLogs}} and the {{BufferPool}} 2. File descriptors are potentially leaked and it's unclear that we clean those up What is odd is that according to a profiler attached while running one of the dtests in a for loop, most of the leaked native memory is either pending finalization or unreachable from GC roots: !LeakedNativeMemory.png! Afaict both {{HintsBuffer}} and {{CommitLogs}} should be getting cleaned in the {{Instance::shutdown}} methods, although I don't think we clean the {{BufferPool}}. Continuing to investigate this so that we can have green runs on trunk again. *Edit:* The test I'm running is applying this diff: {noformat} --- a/test/distributed/org/apache/cassandra/distributed/DistributedReadWritePathTest.java +++ b/test/distributed/org/apache/cassandra/distributed/DistributedReadWritePathTest.java @@ -27,6 +27,15 @@ import static org.apache.cassandra.net.MessagingService.Verb.READ_REPAIR; public class DistributedReadWritePathTest extends DistributedTestBase { + @Test + public void manyCoordinatedReads() throws Throwable + { + for (int i = 0; i < 20; i ++) + { + coordinatorRead(); + } + } + @Test public void coordinatorRead() throws Throwable { {noformat} Then I run the test suite and measure OS memory usage, for example if I re-wind trunk to the first patch (3dcde082) I see only 1.6GB allocated: {noformat} /usr/bin/time -f "mem=%K RSS=%M elapsed=%E cpu.sys=%S .user=%U" ant testclasslist -Dtest.classlistfile=/tmp/java_dtests_1_final.txt -Dtest.classlistprefix=distributed testclasslist: [echo] Number of test runners: 1 [junit-timeout] Testsuite: org.apache.cassandra.distributed.DistributedReadWritePathTest [junit-timeout] Testsuite: org.apache.cassandra.distributed.DistributedReadWritePathTest Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 103.473 sec [junit-timeout] [junitreport] Processing /home/josephl/pg/cassandra/build/test/TESTS-TestSuites.xml to /tmp/null554485593 [junitreport] Loading stylesheet jar:file:/usr/share/ant/lib/ant-junit.jar!/org/apache/tools/ant/taskdefs/optional/junit/xsl/junit-frames.xsl [junitreport] Transform time: 269ms [junitreport] Deleting: /tmp/null554485593 BUILD SUCCESSFUL Total time: 1 minute 47 seconds mem=0 RSS=1606332 elapsed=1:47.37 cpu.sys=10.95 .user=159.99 {noformat} But, if I use latest trunk (e871903d), I get 5GB: {noformat} testclasslist: [echo] Number of test runners: 1 [mkdir] Created dir: /home/josephl/pg/cassandra/build/test/cassandra [mkdir] Created dir: /home/josephl/pg/cassandra/build/test/output [junit-timeout] Testsuite: org.apache.cassandra.distributed.DistributedReadWritePathTest [junit-timeout] Testsuite: org.apache.cassandra.distributed.DistributedReadWritePathTest Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 103.098 sec [junit-timeout] [junitreport] Processing /home/josephl/pg/cassandra/build/test/TESTS-TestSuites.xml to /tmp/null126756458 [junitreport] Loading stylesheet jar:file:/usr/share/ant/lib/ant-junit.jar!/org/apache/tools/ant/taskdefs/optional/junit/xsl/junit-frames.xsl [junitreport] Transform time: 330ms [junitreport] Deleting: /tmp/null126756458 BUILD SUCCESSFUL Total time: 2 minutes 15 seconds mem=0 RSS=4962924 elapsed=2:15.93 cpu.sys=16.55 .user=284.28 {noformat} Since the heap is 1GB, and we allocate about 256MB for the off-heap metaspace, 1.6GB is much closer to what we expect than 5GB. So ,.. something about the new executor system may be contributing. Continuing to dig in. was (Author: jolynch): I'm getting trunk failures again as of e871903d, and after an [IRC discussion|https://wilderness.apache.org/channels/?f=cassandra-dev/2019-01-23] with Benedict it looks like we may be leaking: 1. Off heap memory via some combination of the {{HintsBuffer}}, {{CommitLogs}} and the {{BufferPool}} 2. File descriptors are potentially leaked and it's unclear that we clean those up What is odd is that according to a profiler attached while running one of the dtests in a for loop, most of the leaked native memory is either pending finalization or unreachable from GC roots: !LeakedNativeMemory.png! Afaict both {{HintsBuffer}} and {{CommitLogs}} should be getting cleaned in the {{Instance::shutdown}} methods, although I don't think we clean the {{BufferPool}}. Continuing to investigate this so that we can have green runs on trunk again. > In JVM dtests need to clean up after instance shutdown > ------------------------------------------------------ > > Key: CASSANDRA-14922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14922 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest > Reporter: Joseph Lynch > Assignee: Joseph Lynch > Priority: Minor > Fix For: 4.0 > > Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, > LeakedNativeMemory.png, Leaking_Metrics_On_Shutdown.png, > MainClassRetaining.png, MemoryReclaimedFix.png, > Metaspace_Actually_Collected.png, OnlyThreeRootsLeft.png, > no_more_references.png > > > Currently the unit tests are failing on circleci ([example > one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], > [example > two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) > because we use a small container (medium) for unit tests by default and the > in JVM dtests are leaking a few hundred megabytes of memory per test right > now. This is not a big deal because the dtest runs with the larger containers > continue to function fine as well as local testing as the number of in JVM > dtests is not yet high enough to cause a problem with more than 2GB of > available heap. However we should fix the memory leak so that going forwards > we can add more in JVM dtests without worry. > I've been working with [~ifesdjeen] to debug, and the issue appears to be > unreleased Table/Keyspace metrics (screenshot showing the leak attached). I > believe that we have a few potential issues that are leading to the leaks: > 1. The > [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac291266660c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354] > method is not successfully cleaning up all the metrics created by the > {{CassandraMetricsRegistry}} > 2. The > [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac291266660c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283] > method is not waiting for all the instances to finish shutting down and > cleaning up before continuing on > 3. I'm not sure if this is an issue assuming we clear all metrics, but > [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951] > does not release all the metric references (which could leak them) > I am working on a patch which shuts down everything and assures that we do > not leak memory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org