[jira] (SPARK-43523) Memory leak in Spark UI

Amine Bagdouri (Jira) Sat, 20 May 2023 16:21:06 -0700


    [ https://issues.apache.org/jira/browse/SPARK-43523 ]



    Amine Bagdouri deleted comment on SPARK-43523:
    ----------------------------------------

was (Author: JIRAUSER300423):
[^spark_shell_oom.log]

> Memory leak in Spark UI
> -----------------------
>
>                 Key: SPARK-43523
>                 URL: https://issues.apache.org/jira/browse/SPARK-43523
>             Project: Spark
>          Issue Type: Bug
>          Components: Web UI
>    Affects Versions: 2.4.4, 3.4.0
>            Reporter: Amine Bagdouri
>            Priority: Major
>         Attachments: spark_shell_oom.log, spark_ui_memory_leak.zip
>
>
> We have a distributed Spark application running on Azure HDInsight using 
> Spark version 2.4.4.
> After a few days of active processing on our application, we have noticed 
> that the GC CPU time ratio of the driver is close to 100%. We suspected a 
> memory leak. Thus, we have produced a heap dump and analyzed it using Eclipse 
> Memory Analyzer.
> Here is some interesting data from the driver's heap dump (heap size is 8 GB):
>  * The estimated retained heap size of String objects (~5M instances) is 3.3 
> GB. It seems that most of these instances correspond to spark events.
>  * Spark UI's AppStatusListener instance estimated retained size is 1.1 GB.
>  * The number of LiveJob objects with status "RUNNING" is 18K, knowing that 
> there shouldn't be more than 16 live running jobs since we use a fixed size 
> thread pool of 16 threads to run spark queries.
>  * The number of LiveTask objects is 485K.
>  * The AsyncEventQueue instance associated to the AppStatusListener has a 
> value of 854 for dropped events count and a value of 10001 for total events 
> count, knowing that the dropped events counter is reset every minute and that 
> the queue's default capacity is 10000.
> We think that there is a memory leak in Spark UI. Here is our analysis of the 
> root cause of this leak:
>  * AppStatusListener is notified of Spark events using a bounded queue in 
> AsyncEventQueue.
>  * AppStatusListener updates its state (kvstore, liveTasks, liveStages, 
> liveJobs, ...) based on the received events. For example, onTaskStart adds a 
> task to liveTasks map and onTaskEnd removes the task from liveTasks map.
>  * When the rate of events is very high, the bounded queue in AsyncEventQueue 
> is full, some events are dropped and don't make it to AppStatusListener.
>  * Dropped events that signal the end of a processing unit prevent the state 
> of AppStatusListener from being cleaned. For example, a dropped onTaskEnd 
> event, will prevent the task from being removed from liveTasks map, and the 
> task will remain in the heap until the driver's JVM is stopped.
> We were able to confirm our analysis by reducing the capacity of the 
> AsyncEventQueue (spark.scheduler.listenerbus.eventqueue.capacity=10). After 
> having launched many spark queries using this config, we observed that the 
> number of active jobs in Spark UI increased rapidly and remained high even 
> though all submitted queries have completed. We have also noticed that some 
> executor task counters in Spark UI were negative, which confirms that 
> AppStatusListener state does not accurately reflect the reality and that it 
> can be a victim of event drops.
> Suggested fix:
> There are some limits today on the number of "dead" objects in 
> AppStatusListener's maps (for example: spark.ui.retainedJobs). We suggest 
> enforcing another configurable limit on the number of total objects in 
> AppStatusListener's maps and kvstore. This should limit the leak in the case 
> of high events rate, but AppStatusListener stats will remain inaccurate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] (SPARK-43523) Memory leak in Spark UI

Reply via email to