[ 
https://issues.apache.org/jira/browse/SPARK-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9202.
------------------------------
       Resolution: Fixed
    Fix Version/s: 1.5.0

Issue resolved by pull request 7714
[https://github.com/apache/spark/pull/7714]

> Worker should not have data structures which grow without bound
> ---------------------------------------------------------------
>
>                 Key: SPARK-9202
>                 URL: https://issues.apache.org/jira/browse/SPARK-9202
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy
>    Affects Versions: 1.4.1, 1.5.0
>            Reporter: Josh Rosen
>            Assignee: Nan Zhu
>            Priority: Blocker
>             Fix For: 1.5.0
>
>
> Originally reported by Richard Marscher on the dev list:
> {quote}
> we have been experiencing issues in production over the past couple weeks 
> with Spark Standalone Worker JVMs seeming to have memory leaks. They 
> accumulate Old Gen until it reaches max and then reach a failed state that 
> starts critically failing some applications running against the cluster.
> I've done some exploration of the Spark code base related to Worker in search 
> of potential sources of this problem and am looking for some commentary on a 
> couple theories I have:
> Observation 1: The `finishedExecutors` HashMap seem to only accumulate new 
> entries over time unbounded. It only seems to be appended and never 
> periodically purged or cleaned of older executors in line with something like 
> the worker cleanup scheduler. 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L473
> {quote}
> It looks like the finishedExecutors map is only read when rendering the 
> Worker Web UI and in constructing REST API responses.  I think that we could 
> address this leak by adding a configuration to cap the maximum number of 
> retained executors, applications, etc.  We already have similar caps in the 
> driver UI.  If we add this configuration, I think that we should pick some 
> sensible default value rather than an unlimited one.  This is technically a 
> user-facing behavior change but I think it's okay since the current behavior 
> is to crash / OOM.
> To fix this, we should:
> - Add the new configurations to cap how much data we retain.
> - Add a stress-tester to spark-perf so that we have a way to reproduce these 
> leaks during QA.
> - Add some unit tests to ensure that cleanup is performed at the right 
> places. This test should be modeled after the memory-leak-prevention 
> strategies that we employed in JobProgressListener and in other parts of the 
> Driver UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to