[
https://issues.apache.org/jira/browse/SPARK-29795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean R. Owen resolved SPARK-29795.
----------------------------------
Fix Version/s: 3.0.0
Resolution: Fixed
Issue resolved by pull request 26427
[https://github.com/apache/spark/pull/26427]
> Possible 'leak' of Metrics with dropwizard metrics 4.x
> ------------------------------------------------------
>
> Key: SPARK-29795
> URL: https://issues.apache.org/jira/browse/SPARK-29795
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.0.0
> Reporter: Sean R. Owen
> Assignee: Sean R. Owen
> Priority: Minor
> Fix For: 3.0.0
>
>
> This one's a little complex to explain.
> SPARK-29674 updated dropwizard metrics to 4.x, for Spark 3 only. That appears
> to be fine, according to tests. We have not and do not intend to backport it
> to 2.4.x.
> However, I'm working with a few people trying to back-port this to Spark
> 2.4.x separately. When this update is applied, tests fail readily with
> OutOfMemoryError, typically around ExternalShuffleServiceSuite in core. A
> heap dump analysis shows that MetricRegistry objects are retaining a gigabyte
> or more of memory.
> It appears to be holding references to many large internal Spark objects like
> BlockManager and Netty objects, via closures we pass to Gauge objects.
> Although it looked odd, this may or may not be an issue; in normal usage
> where a JVM hosts one SparkContext, this may normal.
> However in tests where contexts are started/restarted repeatedly, it seems
> like this might 'leak' old references to old context-related objects across
> runs via metrics. I don't have a clear theory on how yet (is SparkEnv shared
> or some ref held to it?), besides the empirical evidence. However, it's also
> not clear why this wouldn't affect Spark 3, apparently, as tests work fine.
> It could be another fix in Spark 3 that happens to help here; it could be
> that Spark 3 uses less memory and never hits the issue.
> Despite that uncertainty, I've found that simply clearing the registered
> metrics from MetricsSystem when it is stop()-ped seems to resolve the issue.
> At this point, Spark is shutting down and sinks have stopped, so there
> doesn't seem to be any harm in manually releasing all registered metrics and
> objects. I don't _think_ it's intended to track metrics across two
> instantiations of a SparkContext in the same JVM, but that's a question.
> That's the change I will propose in a PR.
> Why does this not happen in 2.4 + metrics 3.x? unclear. We've not seen any
> test failures like this in 2.4 or reports of problems with metrics-related
> memory pressure. It could be a change in how 4.x behaves, tracks objects,
> manages lifecycles.
> The difference does not seem to be Scala 2.11 vs 2.12, by the way. 2.4 works
> fine on both without the 4.x update; runs out of memory on both with the
> change.
> Why do this if this only affects 2.4 + metrics 4.x and we're not moving to
> metrics 4.x in 2.4? It could still be a smaller issue in Spark 3, not
> detected by tests. It may help apps that do for various reasons run multiple
> SparkContexts per JVM - like many other project test suites. It may just be
> good for tidiness in shutdown, to manually clear resources.
> Therefore I can't call this a bug per se, maybe an improvement.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]