Josh Rosen created SPARK-9202:
---------------------------------
Summary: Worker should not have data structures which grow without
bound
Key: SPARK-9202
URL: https://issues.apache.org/jira/browse/SPARK-9202
Project: Spark
Issue Type: Bug
Components: Deploy
Affects Versions: 1.4.1, 1.5.0
Reporter: Josh Rosen
Originally reported by Richard Marscher on the dev list:
{quote}
we have been experiencing issues in production over the past couple weeks with
Spark Standalone Worker JVMs seeming to have memory leaks. They accumulate Old
Gen until it reaches max and then reach a failed state that starts critically
failing some applications running against the cluster.
I've done some exploration of the Spark code base related to Worker in search
of potential sources of this problem and am looking for some commentary on a
couple theories I have:
Observation 1: The `finishedExecutors` HashMap seem to only accumulate new
entries over time unbounded. It only seems to be appended and never
periodically purged or cleaned of older executors in line with something like
the worker cleanup scheduler.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L473
{quote}
It looks like the finishedExecutors map is only read when rendering the Worker
Web UI and in constructing REST API responses. I think that we could address
this leak by adding a configuration to cap the maximum number of retained
executors, applications, etc. We already have similar caps in the driver UI.
If we add this configuration, I think that we should pick some sensible default
value rather than an unlimited one. This is technically a user-facing behavior
change but I think it's okay since the current behavior is to crash / OOM.
To fix this, we should:
- Add the new configurations to cap how much data we retain.
- Add a stress-tester to spark-perf so that we have a way to reproduce these
leaks during QA.
- Add some unit tests to ensure that cleanup is performed at the right places.
This test should be modeled after the memory-leak-prevention strategies that we
employed in JobProgressListener and in other parts of the Driver UI.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]