[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

foxish Fri, 06 Jul 2018 08:33:59 -0700

Github user foxish commented on the issue:

    https://github.com/apache/spark/pull/21067
  
    I don't think this current approach will suffice. Correctness is important 
here, especially for folks using spark streaming. I understand that we're 
proposing the use of backoff limits but there is **no guarantee** that a job 
controller **won't** spin up 2 driver pods when we ask for 1. That by 
definition is how the job controller works, by being greedy and working towards 
desired completions. For example, in the case of a network partition, the job 
controller logic in the Kubernetes master will not differentiate between:
    
    1. Losing contact with the driver pod temporarily
    2. Finding no driver pod and starting a new one
    
    This has been the reason why in the past I've proposed using a StatefulSet. 
However, getting termination semantics with a StatefulSet will be more work. I 
don't think we should sacrifice correctness in this layer as it would surprise 
the application author who now has to reason about whether the operation they 
are performing is idempotent.
    
    Can we have a proposal and understand all the subtleties before trying to 
change this behavior. For example, if we end up with more than one driver for a 
single job, I'd like to ensure that only one of them is making progress (for 
ex. by using a lease in ZK).




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

Reply via email to