Josh Rosen created SPARK-9202:
---------------------------------

             Summary: Worker should not have data structures which grow without 
bound
                 Key: SPARK-9202
                 URL: https://issues.apache.org/jira/browse/SPARK-9202
             Project: Spark
          Issue Type: Bug
          Components: Deploy
    Affects Versions: 1.4.1, 1.5.0
            Reporter: Josh Rosen


Originally reported by Richard Marscher on the dev list:

{quote}
we have been experiencing issues in production over the past couple weeks with 
Spark Standalone Worker JVMs seeming to have memory leaks. They accumulate Old 
Gen until it reaches max and then reach a failed state that starts critically 
failing some applications running against the cluster.

I've done some exploration of the Spark code base related to Worker in search 
of potential sources of this problem and am looking for some commentary on a 
couple theories I have:

Observation 1: The `finishedExecutors` HashMap seem to only accumulate new 
entries over time unbounded. It only seems to be appended and never 
periodically purged or cleaned of older executors in line with something like 
the worker cleanup scheduler. 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L473
{quote}

It looks like the finishedExecutors map is only read when rendering the Worker 
Web UI and in constructing REST API responses.  I think that we could address 
this leak by adding a configuration to cap the maximum number of retained 
executors, applications, etc.  We already have similar caps in the driver UI.  
If we add this configuration, I think that we should pick some sensible default 
value rather than an unlimited one.  This is technically a user-facing behavior 
change but I think it's okay since the current behavior is to crash / OOM.

To fix this, we should:

- Add the new configurations to cap how much data we retain.
- Add a stress-tester to spark-perf so that we have a way to reproduce these 
leaks during QA.
- Add some unit tests to ensure that cleanup is performed at the right places. 
This test should be modeled after the memory-leak-prevention strategies that we 
employed in JobProgressListener and in other parts of the Driver UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to