[ https://issues.apache.org/jira/browse/SPARK-29795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-29795. ---------------------------------- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26427 [https://github.com/apache/spark/pull/26427] > Possible 'leak' of Metrics with dropwizard metrics 4.x > ------------------------------------------------------ > > Key: SPARK-29795 > URL: https://issues.apache.org/jira/browse/SPARK-29795 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.0.0 > Reporter: Sean R. Owen > Assignee: Sean R. Owen > Priority: Minor > Fix For: 3.0.0 > > > This one's a little complex to explain. > SPARK-29674 updated dropwizard metrics to 4.x, for Spark 3 only. That appears > to be fine, according to tests. We have not and do not intend to backport it > to 2.4.x. > However, I'm working with a few people trying to back-port this to Spark > 2.4.x separately. When this update is applied, tests fail readily with > OutOfMemoryError, typically around ExternalShuffleServiceSuite in core. A > heap dump analysis shows that MetricRegistry objects are retaining a gigabyte > or more of memory. > It appears to be holding references to many large internal Spark objects like > BlockManager and Netty objects, via closures we pass to Gauge objects. > Although it looked odd, this may or may not be an issue; in normal usage > where a JVM hosts one SparkContext, this may normal. > However in tests where contexts are started/restarted repeatedly, it seems > like this might 'leak' old references to old context-related objects across > runs via metrics. I don't have a clear theory on how yet (is SparkEnv shared > or some ref held to it?), besides the empirical evidence. However, it's also > not clear why this wouldn't affect Spark 3, apparently, as tests work fine. > It could be another fix in Spark 3 that happens to help here; it could be > that Spark 3 uses less memory and never hits the issue. > Despite that uncertainty, I've found that simply clearing the registered > metrics from MetricsSystem when it is stop()-ped seems to resolve the issue. > At this point, Spark is shutting down and sinks have stopped, so there > doesn't seem to be any harm in manually releasing all registered metrics and > objects. I don't _think_ it's intended to track metrics across two > instantiations of a SparkContext in the same JVM, but that's a question. > That's the change I will propose in a PR. > Why does this not happen in 2.4 + metrics 3.x? unclear. We've not seen any > test failures like this in 2.4 or reports of problems with metrics-related > memory pressure. It could be a change in how 4.x behaves, tracks objects, > manages lifecycles. > The difference does not seem to be Scala 2.11 vs 2.12, by the way. 2.4 works > fine on both without the 4.x update; runs out of memory on both with the > change. > Why do this if this only affects 2.4 + metrics 4.x and we're not moving to > metrics 4.x in 2.4? It could still be a smaller issue in Spark 3, not > detected by tests. It may help apps that do for various reasons run multiple > SparkContexts per JVM - like many other project test suites. It may just be > good for tidiness in shutdown, to manually clear resources. > Therefore I can't call this a bug per se, maybe an improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org