[
https://issues.apache.org/jira/browse/SPARK-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-9202.
------------------------------
Resolution: Fixed
Fix Version/s: 1.5.0
Issue resolved by pull request 7714
[https://github.com/apache/spark/pull/7714]
> Worker should not have data structures which grow without bound
> ---------------------------------------------------------------
>
> Key: SPARK-9202
> URL: https://issues.apache.org/jira/browse/SPARK-9202
> Project: Spark
> Issue Type: Bug
> Components: Deploy
> Affects Versions: 1.4.1, 1.5.0
> Reporter: Josh Rosen
> Assignee: Nan Zhu
> Priority: Blocker
> Fix For: 1.5.0
>
>
> Originally reported by Richard Marscher on the dev list:
> {quote}
> we have been experiencing issues in production over the past couple weeks
> with Spark Standalone Worker JVMs seeming to have memory leaks. They
> accumulate Old Gen until it reaches max and then reach a failed state that
> starts critically failing some applications running against the cluster.
> I've done some exploration of the Spark code base related to Worker in search
> of potential sources of this problem and am looking for some commentary on a
> couple theories I have:
> Observation 1: The `finishedExecutors` HashMap seem to only accumulate new
> entries over time unbounded. It only seems to be appended and never
> periodically purged or cleaned of older executors in line with something like
> the worker cleanup scheduler.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L473
> {quote}
> It looks like the finishedExecutors map is only read when rendering the
> Worker Web UI and in constructing REST API responses. I think that we could
> address this leak by adding a configuration to cap the maximum number of
> retained executors, applications, etc. We already have similar caps in the
> driver UI. If we add this configuration, I think that we should pick some
> sensible default value rather than an unlimited one. This is technically a
> user-facing behavior change but I think it's okay since the current behavior
> is to crash / OOM.
> To fix this, we should:
> - Add the new configurations to cap how much data we retain.
> - Add a stress-tester to spark-perf so that we have a way to reproduce these
> leaks during QA.
> - Add some unit tests to ensure that cleanup is performed at the right
> places. This test should be modeled after the memory-leak-prevention
> strategies that we employed in JobProgressListener and in other parts of the
> Driver UI.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]