[
https://issues.apache.org/jira/browse/SPARK-37910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-37910:
----------------------------------
Priority: Major (was: Blocker)
> Spark executor self-exiting due to driver disassociated in Kubernetes with
> client deploy-mode
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-37910
> URL: https://issues.apache.org/jira/browse/SPARK-37910
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes
> Affects Versions: 3.2.0
> Reporter: Petri
> Priority: Major
>
> I have Spark driver running in a Kubernetes pod with client deploy-mode and
> it tries to start an executor.
> Executor will fail with error:
> \{"type":"log", "level":"ERROR", "name":"STREAMING_OTHERS",
> "time":"2022-01-14T12:29:38.318Z", "timezone":"UTC",
> "class":"dispatcher-Executor",
> "method":"spark.executor.CoarseGrainedExecutorBackend.logError(73)",
> "log":"Executor self-exiting due to : Driver
> 192-168-39-71.mni-system.pod.cluster.local:40752 disassociated! Shutting
> down.\n"}
> Then driver will attempt to start another executor which fails with same
> error and this goes on and on.
> In the driver pod, I see only following errors:
> 22/01/14 12:26:32 ERROR TaskSchedulerImpl: Lost executor 1 on
> 192.168.43.250:
> 22/01/14 12:27:16 ERROR TaskSchedulerImpl: Lost executor 2 on
> 192.168.43.233:
> 22/01/14 12:27:59 ERROR TaskSchedulerImpl: Lost executor 3 on
> 192.168.43.221:
> 22/01/14 12:28:43 ERROR TaskSchedulerImpl: Lost executor 4 on
> 192.168.43.217:
> 22/01/14 12:29:27 ERROR TaskSchedulerImpl: Lost executor 5 on
> 192.168.43.197:
> 22/01/14 12:30:10 ERROR TaskSchedulerImpl: Lost executor 6 on
> 192.168.43.237:
> 22/01/14 12:30:53 ERROR TaskSchedulerImpl: Lost executor 7 on
> 192.168.43.196:
> 22/01/14 12:31:42 ERROR TaskSchedulerImpl: Lost executor 8 on
> 192.168.43.228:
> 22/01/14 12:32:31 ERROR TaskSchedulerImpl: Lost executor 9 on
> 192.168.43.254:
> 22/01/14 12:33:14 ERROR TaskSchedulerImpl: Lost executor 10 on
> 192.168.43.204:
> 22/01/14 12:33:57 ERROR TaskSchedulerImpl: Lost executor 11 on
> 192.168.43.231:
> What is wrong? And how can I get executors running correctly?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]