Hi, we have been experiencing issues in production over the past couple weeks with Spark Standalone Worker JVMs seeming to have memory leaks. They accumulate Old Gen until it reaches max and then reach a failed state that starts critically failing some applications running against the cluster.
I've done some exploration of the Spark code base related to Worker in search of potential sources of this problem and am looking for some commentary on a couple theories I have: Observation 1: The `finishedExecutors` HashMap seem to only accumulate new entries over time unbounded. It only seems to be appended and never periodically purged or cleaned of older executors in line with something like the worker cleanup scheduler. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L473 I feel somewhat confident that over time this will exhibit a "leak". I quote it just because it may be intentional to hold these references to support functionality versus a true leak where you just accidentally hold onto memory. Observation 2: I feel much less certain about this, but it seemed like if the Worker is messaged with `KillExecutor` then it only kills the ` ExecutorRunner` but does not clean it up from the executor map. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L492 I haven't been able to sort out whether I'm missing something indirect where it before/after cleans that executor from the map. However, if it does not, then it may be leaking references on this map. One final observation related to our production metrics and not the codebase itself. We used to periodically see that our completed applications had the status of "Killed" instead of "Exited" for all the executors. However, now we see every completed application has a final state of "Killed" for all the executors. I might speculatively correlate this to Observation 2 as a potential reason we have started seeing this issue more recently. We also have a larger and increasing workload over the past few weeks and possibly code changes to the application description that could be exacerbating these potential underlying issues. We run a lot of smaller applications per day, something in the range of hundreds to maybe 1000 applications per day with 16 executors per application. Thanks -- *Richard Marscher* Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>