[ https://issues.apache.org/jira/browse/SPARK-43523 ]
Amine Bagdouri deleted comment on SPARK-43523:
----------------------------------------
was (Author: JIRAUSER300423):
[^spark_shell_oom.log]
> Memory leak in Spark UI
> -----------------------
>
> Key: SPARK-43523
> URL: https://issues.apache.org/jira/browse/SPARK-43523
> Project: Spark
> Issue Type: Bug
> Components: Web UI
> Affects Versions: 2.4.4, 3.4.0
> Reporter: Amine Bagdouri
> Priority: Major
> Attachments: spark_shell_oom.log, spark_ui_memory_leak.zip
>
>
> We have a distributed Spark application running on Azure HDInsight using
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed
> that the GC CPU time ratio of the driver is close to 100%. We suspected a
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
> * The estimated retained heap size of String objects (~5M instances) is 3.3
> GB. It seems that most of these instances correspond to spark events.
> * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
> * The number of LiveJob objects with status "RUNNING" is 18K, knowing that
> there shouldn't be more than 16 live running jobs since we use a fixed size
> thread pool of 16 threads to run spark queries.
> * The number of LiveTask objects is 485K.
> * The AsyncEventQueue instance associated to the AppStatusListener has a
> value of 854 for dropped events count and a value of 10001 for total events
> count, knowing that the dropped events counter is reset every minute and that
> the queue's default capacity is 10000.
> We think that there is a memory leak in Spark UI. Here is our analysis of the
> root cause of this leak:
> * AppStatusListener is notified of Spark events using a bounded queue in
> AsyncEventQueue.
> * AppStatusListener updates its state (kvstore, liveTasks, liveStages,
> liveJobs, ...) based on the received events. For example, onTaskStart adds a
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
> * When the rate of events is very high, the bounded queue in AsyncEventQueue
> is full, some events are dropped and don't make it to AppStatusListener.
> * Dropped events that signal the end of a processing unit prevent the state
> of AppStatusListener from being cleaned. For example, a dropped onTaskEnd
> event, will prevent the task from being removed from liveTasks map, and the
> task will remain in the heap until the driver's JVM is stopped.
> We were able to confirm our analysis by reducing the capacity of the
> AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After
> having launched many spark queries using this config, we observed that the
> number of active jobs in Spark UI increased rapidly and remained high even
> though all submitted queries have completed. We have also noticed that some
> executor task counters in Spark UI were negative, which confirms that
> AppStatusListener state does not accurately reflect the reality and that it
> can be a victim of event drops.
> Suggested fix:
> There are some limits today on the number of "dead" objects in
> AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest
> enforcing another configurable limit on the number of total objects in
> AppStatusListener's maps and kvstore. This should limit the leak in the case
> of high events rate, but AppStatusListener stats will remain inaccurate.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]