Github user foxish commented on the issue:
https://github.com/apache/spark/pull/21067
I don't think this current approach will suffice. Correctness is important
here, especially for folks using spark streaming. I understand that we're
proposing the use of backoff limits but there is **no guarantee** that a job
controller **won't** spin up 2 driver pods when we ask for 1. That by
definition is how the job controller works, by being greedy and working towards
desired completions. For example, in the case of a network partition, the job
controller logic in the Kubernetes master will not differentiate between:
1. Losing contact with the driver pod temporarily
2. Finding no driver pod and starting a new one
This has been the reason why in the past I've proposed using a StatefulSet.
However, getting termination semantics with a StatefulSet will be more work. I
don't think we should sacrifice correctness in this layer as it would surprise
the application author who now has to reason about whether the operation they
are performing is idempotent.
Can we have a proposal and understand all the subtleties before trying to
change this behavior. For example, if we end up with more than one driver for a
single job, I'd like to ensure that only one of them is making progress (for
ex. by using a lease in ZK).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]