khogeland commented on issue #26687: [SPARK-30055][k8s] Allow configuration of restart policy for Kubernetes pods URL: https://github.com/apache/spark/pull/26687#issuecomment-566313232 >We would want to make sure we're not leaking executor pods. Executors are identified by a label containing the application ID ([here](https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L40)), which is put by SparkSubmit in the driver's Spark configuration via a ConfigMap ([here](https://github.com/apache/spark/blob/946aef05351a9db4c5a352992bc5556a6914ea6f/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L158) and [here](https://github.com/apache/spark/blob/02c5b4f76337cc3901b8741887292bb4478931f3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L100)). If the driver process restarts, it will have the same app ID and will be able to locate the executors from the previous run. >The executor pods may not be able to contact the driver however and might choose to exit accordingly. It looks like the default timeout for that is 10 minutes of failure to heartbeat to the driver: https://github.com/apache/spark/blob/ad238a2238a9d0da89be4424574436cbfaee579d/core/src/main/scala/org/apache/spark/internal/config/package.scala#L210-L216 It obviously shouldn't take that long to restart the driver, and in any case, the retry is effectively infinite if the executor is configured to restart when it exits. The driver address also doesn't change when the driver process is restarted. Although it is for some reason a k8s service, meaning the route will be broken while the driver pod is unready. So, in the worst case, the executor hits its 10 second timeout before reconnecting successfully.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
