khogeland commented on issue #26687: [SPARK-30055][k8s] Allow configuration of 
restart policy for Kubernetes pods
URL: https://github.com/apache/spark/pull/26687#issuecomment-566807128
 
 
   Responding to this first because I think it's important to frame this change 
correctly:
   >This sounds a bit risky to me and the only advantage I'm seeing here is 
avoiding the resource allocation step in the k8s server.
   
   - This will allow someone to use the standard Spark distro to deploy a basic 
workload without relying on manual intervention or 3rd party software for 
failure handling. That's a big step in the direction of true native support for 
Kubernetes in Spark (the next step being using the Kubernetes controllers: 
[SPARK-24122](https://issues.apache.org/jira/browse/SPARK-24122?jql=project%20%3D%20SPARK%20AND%20text%20~%20restartpolicy#)).
 The current implementation is a great start, but a complicated external 
scheduler process is still _required_ to run production Spark applications on 
Kubernetes. (@liyinan926, I think you may find this discussion interesting!)
   - The scheduling delay should't be understated. Between scheduling, image 
pulls, init containers, JVM/Spark startup, this is in practice often a 
multiple-minute delay in application execution.
   - Another advantage is better persistence of cached data across the 
application. If the driver exits, the executors don't get shut down, so they 
keep their BlockManager cache (correct me if I'm wrong here, btw). And the 
executor doesn't lose its filesystem on driver or executor restart.
   
   >So the scary part here is that the driver will try to start more executors 
on its restart, right? 
   
   No, it will discover the executor pods from the previous run before checking 
how many need to be scheduled. (Although, there is a startup race condition 
that I'll push a fix for, `ExecutorPodsPollingSnapshotSource` needs to be 
polled once to populate the snapshot store before the allocator is started).
   
   > What happens when you restart an executor reusing the same pod, meaning it 
will have the same configuration as before and thus the same executor ID?
   
   This is an excellent question, and I'll dig into this. If reusing the 
executor ID isn't supported, could it just be randomly generated on startup?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to