khogeland edited a comment on issue #26687: [SPARK-30055][k8s] Allow 
configuration of restart policy for Kubernetes pods
URL: https://github.com/apache/spark/pull/26687#issuecomment-566313232
 
 
   >If the driver pod restarts, what happens to the executor pods that were 
running in the previous execution of the application? We would want to make 
sure we're not leaking executor pods.
   
   The driver identifies its executors using a label containing the application 
ID 
([here](https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L40)).
 The ID is put by SparkSubmit in the driver's Spark configuration via a 
ConfigMap 
([here](https://github.com/apache/spark/blob/946aef05351a9db4c5a352992bc5556a6914ea6f/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L158)
 and 
[here](https://github.com/apache/spark/blob/02c5b4f76337cc3901b8741887292bb4478931f3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L100)).
 If the driver process restarts, it will have the same app ID and will be able 
to locate the executors from the previous run.
   
   >The executor pods may not be able to contact the driver however and might 
choose to exit accordingly.
   
   It looks like the default timeout for that is 10 minutes of failure to 
heartbeat to the driver: 
https://github.com/apache/spark/blob/ad238a2238a9d0da89be4424574436cbfaee579d/core/src/main/scala/org/apache/spark/internal/config/package.scala#L210-L216
   It obviously shouldn't take that long to restart the driver, and in any 
case, the retry is effectively infinite if the executor is configured to 
restart when it exits. The driver address also doesn't change when the driver 
process is restarted. Although it is for some reason a k8s service, meaning the 
route will be broken while the driver pod is unready. So, in the worst case, 
the executor hits its 10 second timeout before reconnecting successfully.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to