[ https://issues.apache.org/jira/browse/SPARK-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-9202. ------------------------------ Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7714 [https://github.com/apache/spark/pull/7714] > Worker should not have data structures which grow without bound > --------------------------------------------------------------- > > Key: SPARK-9202 > URL: https://issues.apache.org/jira/browse/SPARK-9202 > Project: Spark > Issue Type: Bug > Components: Deploy > Affects Versions: 1.4.1, 1.5.0 > Reporter: Josh Rosen > Assignee: Nan Zhu > Priority: Blocker > Fix For: 1.5.0 > > > Originally reported by Richard Marscher on the dev list: > {quote} > we have been experiencing issues in production over the past couple weeks > with Spark Standalone Worker JVMs seeming to have memory leaks. They > accumulate Old Gen until it reaches max and then reach a failed state that > starts critically failing some applications running against the cluster. > I've done some exploration of the Spark code base related to Worker in search > of potential sources of this problem and am looking for some commentary on a > couple theories I have: > Observation 1: The `finishedExecutors` HashMap seem to only accumulate new > entries over time unbounded. It only seems to be appended and never > periodically purged or cleaned of older executors in line with something like > the worker cleanup scheduler. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L473 > {quote} > It looks like the finishedExecutors map is only read when rendering the > Worker Web UI and in constructing REST API responses. I think that we could > address this leak by adding a configuration to cap the maximum number of > retained executors, applications, etc. We already have similar caps in the > driver UI. If we add this configuration, I think that we should pick some > sensible default value rather than an unlimited one. This is technically a > user-facing behavior change but I think it's okay since the current behavior > is to crash / OOM. > To fix this, we should: > - Add the new configurations to cap how much data we retain. > - Add a stress-tester to spark-perf so that we have a way to reproduce these > leaks during QA. > - Add some unit tests to ensure that cleanup is performed at the right > places. This test should be modeled after the memory-leak-prevention > strategies that we employed in JobProgressListener and in other parts of the > Driver UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org