[ 
https://issues.apache.org/jira/browse/SPARK-32795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Tso updated SPARK-32795:
-------------------------------
    Description: 
!image-2020-09-03-23-27-11-809.png!

In my case, the Standalone Spark master process had a max heap of 1g. 738mb 
were consumed by these ExecutorDesc objects, the vast majority of which were 
the 18.5M removedExecutors. This caused the master to OOM and leave the 
application driver process dangling.

The reason for this is that the worker node ran out of disk space, so for 
whatever reason decided to go in a fast and endless loop trying to launch new 
executors and they in turn crashed too. It got up to the 18M before the master 
just couldn't handle the history anymore.

  was:
!image-2020-09-03-23-23-45-294.png!

In my case, the Standalone Spark master process had a max heap of 1g. 738mb 
were consumed by these ExecutorDesc objects, the vast majority of which were 
the 18.5M removedExecutors. This caused the master to OOM and leave the 
application driver process dangling.

The reason for this is that the worker node ran out of disk space, so for 
whatever reason decided to go in a fast and endless loop trying to launch new 
executors and they in turn crashed too. It got up to the 18M before the master 
just couldn't handle the history anymore.


> ApplicationInfo#removedExecutors can cause OOM
> ----------------------------------------------
>
>                 Key: SPARK-32795
>                 URL: https://issues.apache.org/jira/browse/SPARK-32795
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Victor Tso
>            Priority: Critical
>         Attachments: image-2020-09-03-23-27-11-809.png
>
>
> !image-2020-09-03-23-27-11-809.png!
> In my case, the Standalone Spark master process had a max heap of 1g. 738mb 
> were consumed by these ExecutorDesc objects, the vast majority of which were 
> the 18.5M removedExecutors. This caused the master to OOM and leave the 
> application driver process dangling.
> The reason for this is that the worker node ran out of disk space, so for 
> whatever reason decided to go in a fast and endless loop trying to launch new 
> executors and they in turn crashed too. It got up to the 18M before the 
> master just couldn't handle the history anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to