[ https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478482#comment-17478482 ]
Petri commented on SPARK-37910: ------------------------------- Also the error message we get in the executor is pretty vague: Executor self-exiting due to : Driver 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting down. It raises questions: * What does the disassociation mean? Is it anything related to disconnection or what? * Why the executor must self-exit? Would it be possible to retry driver association? It would be good to improve the error message and related documentation. > Spark executor self-exiting due to driver disassociated in Kubernetes with > client deploy-mode > --------------------------------------------------------------------------------------------- > > Key: SPARK-37910 > URL: https://issues.apache.org/jira/browse/SPARK-37910 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 3.2.0 > Reporter: Petri > Priority: Major > > I have Spark driver running in a Kubernetes pod with client deploy-mode and > it tries to start an executor. > Executor will fail with error: > \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS", > "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC", > "class":"dispatcher-Executor", > "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)", > "log":"Executor self-exiting due to : Driver > 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting > down.\n"} > Then driver will attempt to start another executor which fails with same > error and this goes on and on. > In the driver pod, I see only following errors: > 22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on > 192.168.43.250: > 22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on > 192.168.43.233: > 22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on > 192.168.43.221: > 22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on > 192.168.43.217: > 22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on > 192.168.43.197: > 22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on > 192.168.43.237: > 22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on > 192.168.43.196: > 22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on > 192.168.43.228: > 22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on > 192.168.43.254: > 22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on > 192.168.43.204: > 22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on > 192.168.43.231: > What is wrong? And how can I get executors running correctly? -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org