khogeland commented on issue #26687: [SPARK-30055][k8s] Allow configuration of restart policy for Kubernetes pods URL: https://github.com/apache/spark/pull/26687#issuecomment-566807128 Responding to this first because I think it's important to frame this change correctly: >This sounds a bit risky to me and the only advantage I'm seeing here is avoiding the resource allocation step in the k8s server. - This will allow someone to use the standard Spark distro to deploy a basic workload without relying on manual intervention or 3rd party software for failure handling. That's a big step in the direction of true native support for Kubernetes in Spark (the next step being using the Kubernetes controllers: [SPARK-24122](https://issues.apache.org/jira/browse/SPARK-24122?jql=project%20%3D%20SPARK%20AND%20text%20~%20restartpolicy#)). The current implementation is a great start, but a complicated external scheduler process is still _required_ to run production Spark applications on Kubernetes. (@liyinan926, I think you may find this discussion interesting!) - The scheduling delay should't be understated. Between scheduling, image pulls, init containers, JVM/Spark startup, this is in practice often a multiple-minute delay in application execution. - Another advantage is better persistence of cached data across the application. If the driver exits, the executors don't get shut down, so they keep their BlockManager cache (correct me if I'm wrong here, btw). And the executor doesn't lose its filesystem on driver or executor restart. >So the scary part here is that the driver will try to start more executors on its restart, right? No, it will discover the executor pods from the previous run before checking how many need to be scheduled. (Although, there is a startup race condition that I'll push a fix for, `ExecutorPodsPollingSnapshotSource` needs to be polled once to populate the snapshot store before the allocator is started). > What happens when you restart an executor reusing the same pod, meaning it will have the same configuration as before and thus the same executor ID? This is an excellent question, and I'll dig into this. If reusing the executor ID isn't supported, could it just be randomly generated on startup?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
