khogeland commented on issue #26687: [SPARK-30055][k8s] Allow configuration of 
restart policy for Kubernetes pods
URL: https://github.com/apache/spark/pull/26687#issuecomment-566313232
 
 
   >We would want to make sure we're not leaking executor pods.
   
   Executors are identified by a label containing the application ID 
([here](https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala#L40)),
 which is put by SparkSubmit in the driver's Spark configuration via a 
ConfigMap 
([here](https://github.com/apache/spark/blob/946aef05351a9db4c5a352992bc5556a6914ea6f/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L158)
 and 
[here](https://github.com/apache/spark/blob/02c5b4f76337cc3901b8741887292bb4478931f3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L100)).
 If the driver process restarts, it will have the same app ID and will be able 
to locate the executors from the previous run.
   
   >The executor pods may not be able to contact the driver however and might 
choose to exit accordingly.
   
   It looks like the default timeout for that is 10 minutes of failure to 
heartbeat to the driver: 
https://github.com/apache/spark/blob/ad238a2238a9d0da89be4424574436cbfaee579d/core/src/main/scala/org/apache/spark/internal/config/package.scala#L210-L216
   It obviously shouldn't take that long to restart the driver, and in any 
case, the retry is effectively infinite if the executor is configured to 
restart when it exits. The driver address also doesn't change when the driver 
process is restarted. Although it is for some reason a k8s service, meaning the 
route will be broken while the driver pod is unready. So, in the worst case, 
the executor hits its 10 second timeout before reconnecting successfully.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to