[ 
https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478482#comment-17478482
 ] 

Petri commented on SPARK-37910:
-------------------------------

Also the error message we get in the executor is pretty vague:

 Executor self-exiting due to : Driver 
192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting down.

It raises questions:
 * What does the disassociation mean? Is it anything related to disconnection 
or what?
 * Why the executor must self-exit? Would it be possible to retry driver 
association?

It would be good to improve the error message and related documentation.

> Spark executor self-exiting due to driver disassociated in Kubernetes with 
> client deploy-mode
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-37910
>                 URL: https://issues.apache.org/jira/browse/SPARK-37910
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.2.0
>            Reporter: Petri
>            Priority: Major
>
> I have Spark driver running in a Kubernetes pod with client deploy-mode and 
> it tries to start an executor.
> Executor will fail with error:
>     \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", 
> "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", 
> "class":"dispatcher-Executor", 
> "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", 
> "log":"Executor self-exiting due to : Driver 
> 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting 
> down.\n"}
> Then driver will attempt to start another executor which fails with same 
> error and this goes on and on.
> In the driver pod, I see only following errors:
>     22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on 
> 192.168.43.250:
>     22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on 
> 192.168.43.233:
>     22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on 
> 192.168.43.221:
>     22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on 
> 192.168.43.217:
>     22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on 
> 192.168.43.197:
>     22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on 
> 192.168.43.237:
>     22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on 
> 192.168.43.196:
>     22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on 
> 192.168.43.228:
>     22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on 
> 192.168.43.254:
>     22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on 
> 192.168.43.204:
>     22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on 
> 192.168.43.231:
> What is wrong? And how can I get executors running correctly?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to