[jira] [Resolved] (SPARK-29795) Possible 'leak' of Metrics with dropwizard metrics 4.x

Sean R. Owen (Jira) Sat, 09 Nov 2019 06:49:08 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-29795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean R. Owen resolved SPARK-29795.
----------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 26427
[https://github.com/apache/spark/pull/26427]

> Possible 'leak' of Metrics with dropwizard metrics 4.x
> ------------------------------------------------------
>
>                 Key: SPARK-29795
>                 URL: https://issues.apache.org/jira/browse/SPARK-29795
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Sean R. Owen
>            Assignee: Sean R. Owen
>            Priority: Minor
>             Fix For: 3.0.0
>
>
> This one's a little complex to explain.
> SPARK-29674 updated dropwizard metrics to 4.x, for Spark 3 only. That appears 
> to be fine, according to tests. We have not and do not intend to backport it 
> to 2.4.x.
> However, I'm working with a few people trying to back-port this to Spark 
> 2.4.x separately. When this update is applied, tests fail readily with 
> OutOfMemoryError, typically around ExternalShuffleServiceSuite in core. A 
> heap dump analysis shows that MetricRegistry objects are retaining a gigabyte 
> or more of memory.
> It appears to be holding references to many large internal Spark objects like 
> BlockManager and Netty objects, via closures we pass to Gauge objects. 
> Although it looked odd, this may or may not be an issue; in normal usage 
> where a JVM hosts one SparkContext, this may normal.
> However in tests where contexts are started/restarted repeatedly, it seems 
> like this might 'leak' old references to old context-related objects across 
> runs via metrics. I don't have a clear theory on how yet (is SparkEnv shared 
> or some ref held to it?), besides the empirical evidence. However, it's also 
> not clear why this wouldn't affect Spark 3, apparently, as tests work fine. 
> It could be another fix in Spark 3 that happens to help here; it could be 
> that Spark 3 uses less memory and never hits the issue.
> Despite that uncertainty, I've found that simply clearing the registered 
> metrics from MetricsSystem when it is stop()-ped seems to resolve the issue. 
> At this point, Spark is shutting down and sinks have stopped, so there 
> doesn't seem to be any harm in manually releasing all registered metrics and 
> objects. I don't _think_ it's intended to track metrics across two 
> instantiations of a SparkContext in the same JVM, but that's a question.
> That's the change I will propose in a PR.
> Why does this not happen in 2.4 + metrics 3.x? unclear. We've not seen any 
> test failures like this in 2.4 or reports of problems with metrics-related 
> memory pressure. It could be a change in how 4.x behaves, tracks objects, 
> manages lifecycles.
> The difference does not seem to be Scala 2.11 vs 2.12, by the way. 2.4 works 
> fine on both without the 4.x update; runs out of memory on both with the 
> change.
> Why do this if this only affects 2.4 + metrics 4.x and we're not moving to 
> metrics 4.x in 2.4? It could still be a smaller issue in Spark 3, not 
> detected by tests. It may help apps that do for various reasons run multiple 
> SparkContexts per JVM - like many other project test suites. It may just be 
> good for tidiness in shutdown, to manually clear resources.
> Therefore I can't call this a bug per se, maybe an improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29795) Possible 'leak' of Metrics with dropwizard metrics 4.x

Reply via email to