[GitHub] [spark] khogeland commented on issue #26687: [SPARK-30055][k8s] Allow configuration of restart policy for Kubernetes pods

GitBox Tue, 17 Dec 2019 16:13:39 -0800

khogeland commented on issue #26687: [SPARK-30055][k8s] Allow configuration of
restart policy for Kubernetes pods
URL: https://github.com/apache/spark/pull/26687#issuecomment-566807128

Responding to this first because I think it's important to frame this change
correctly:
>This sounds a bit risky to me and the only advantage I'm seeing here is
avoiding the resource allocation step in the k8s server.

- This will allow someone to use the standard Spark distro to deploy a basic
workload without relying on manual intervention or 3rd party software for
failure handling. That's a big step in the direction of true native support for
Kubernetes in Spark (the next step being using the Kubernetes controllers:
[SPARK-24122](https://issues.apache.org/jira/browse/SPARK-24122?jql=project%20%3D%20SPARK%20AND%20text%20~%20restartpolicy#)).
The current implementation is a great start, but a complicated external
scheduler process is still _required_ to run production Spark applications on
Kubernetes. (@liyinan926, I think you may find this discussion interesting!)
- The scheduling delay should't be understated. Between scheduling, image
pulls, init containers, JVM/Spark startup, this is in practice often a
multiple-minute delay in application execution.
- Another advantage is better persistence of cached data across the
application. If the driver exits, the executors don't get shut down, so they
keep their BlockManager cache (correct me if I'm wrong here, btw). And the
executor doesn't lose its filesystem on driver or executor restart.

>So the scary part here is that the driver will try to start more executors
on its restart, right?

No, it will discover the executor pods from the previous run before checking
how many need to be scheduled. (Although, there is a startup race condition
that I'll push a fix for, `ExecutorPodsPollingSnapshotSource` needs to be
polled once to populate the snapshot store before the allocator is started).

> What happens when you restart an executor reusing the same pod, meaning it
will have the same configuration as before and thus the same executor ID?

This is an excellent question, and I'll dig into this. If reusing the
executor ID isn't supported, could it just be randomly generated on startup?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] khogeland commented on issue #26687: [SPARK-30055][k8s] Allow configuration of restart policy for Kubernetes pods

Reply via email to