[ 
https://issues.apache.org/jira/browse/CASSANDRA-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740530#comment-16740530
 ] 

Benedict commented on CASSANDRA-14922:
--------------------------------------

I've pushed a branch 
[here|https://github.com/belliottsmith/cassandra/tree/14922-followup] that 
removes the cleanup of thread locals, and removes the cluster-wide executor, 
instead introducing a node-specific executor on which all invocations to that 
node happen.  This might introduce some slight penalty for evaluating tests, as 
we have multiple synchronisation points where control-flow is handed between 
threads, but it guarantees that all state remains isolated without ever 
burdening the user of the API to restrict their own program design.

We can elaborate on this later to make it ergonomic to use the executor, or 
perhaps to skip the executor, in order to provide efficiency where it matters.

I think it makes sense to classify this as part of the work for this ticket, 
and I will then backport it alongside the original patch for this ticket and 
the jvm-dtests, however it's primarily necessary for _future_ reasons:
 * 3.0 and earlier use {{ThreadLocal}}, not {{FastThreadLocal}}, so we need the 
earlier hackier reflection to fix on these
 * The current approach only cleans up the main thread - any other threads may 
themselves retain state indefinitely
 * Most importantly, when we begin stopping and starting nodes, and/or 
upgrading them, the cluster-wide executor service will retain thread local 
state from the stopped nodes, potentially indefinitely, probably leading to a 
steady leak of memory.

It seems easiest to introduce the future-proof approach now and port it to all 
the versions at once.

On a related note, there is CASSANDRA-14974, which simultaneously fixes a bug 
in our stdout capture, _and_ the impact this has to the cleanup of a 
{{TestCluster}} that occurs once this bug is fixed.  If somebody watching this 
ticket fancies reviewing that ticket, it would also be appreciated.

> In JVM dtests need to clean up after instance shutdown
> ------------------------------------------------------
>
>                 Key: CASSANDRA-14922
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14922
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest
>            Reporter: Joseph Lynch
>            Assignee: Joseph Lynch
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: AllThreadsStopped.png, ClassLoadersRetaining.png, 
> Leaking_Metrics_On_Shutdown.png, MainClassRetaining.png, 
> MemoryReclaimedFix.png, Metaspace_Actually_Collected.png, 
> OnlyThreeRootsLeft.png, no_more_references.png
>
>
> Currently the unit tests are failing on circleci ([example 
> one|https://circleci.com/gh/jolynch/cassandra/300#tests/containers/1], 
> [example 
> two|https://circleci.com/gh/rustyrazorblade/cassandra/44#tests/containers/1]) 
> because we use a small container (medium) for unit tests by default and the 
> in JVM dtests are leaking a few hundred megabytes of memory per test right 
> now. This is not a big deal because the dtest runs with the larger containers 
> continue to function fine as well as local testing as the number of in JVM 
> dtests is not yet high enough to cause a problem with more than 2GB of 
> available heap. However we should fix the memory leak so that going forwards 
> we can add more in JVM dtests without worry.
> I've been working with [~ifesdjeen] to debug, and the issue appears to be 
> unreleased Table/Keyspace metrics (screenshot showing the leak attached). I 
> believe that we have a few potential issues that are leading to the leaks:
> 1. The 
> [{{Instance::shutdown}}|https://github.com/apache/cassandra/blob/f22fec927de7ac291266660c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/Instance.java#L328-L354]
>  method is not successfully cleaning up all the metrics created by the 
> {{CassandraMetricsRegistry}}
>  2. The 
> [{{TestCluster::close}}|https://github.com/apache/cassandra/blob/f22fec927de7ac291266660c2f34de5b8cc1c695/test/distributed/org/apache/cassandra/distributed/TestCluster.java#L283]
>  method is not waiting for all the instances to finish shutting down and 
> cleaning up before continuing on
> 3. I'm not sure if this is an issue assuming we clear all metrics, but 
> [{{TableMetrics::release}}|https://github.com/apache/cassandra/blob/4ae229f5cd270c2b43475b3f752a7b228de260ea/src/java/org/apache/cassandra/metrics/TableMetrics.java#L951]
>  does not release all the metric references (which could leak them)
> I am working on a patch which shuts down everything and assures that we do 
> not leak memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to