Nataneljpwd commented on PR #60717:
URL: https://github.com/apache/airflow/pull/60717#issuecomment-3798377300
> Yes. The SparkApplication is the owner of the driver pod. The
`SparkKubernetesOpeator` merely submits the SparkApplication, which then
handles the lifecycle of the driver pod. I believe this would occur in a very
narrow window between the submission of the SparkApplication and the execution
of `find_spark_job` during which the aforementioned scheduler crash could
occur. Consider the following timeline:
>
> 1. Worker submits SparkApplication
> 2. SparkApplication creates driver pod
> 3. Scheduler crashes (e.g. OOM)
> 4. SparkApplication later creates a second driver pod (with the same label)
> 5. Scheduler restarts
> 6. Worker runs `find_spark_job`
> 7. `find_spark_job` sees two matching pods
How can a spark application create another driver pod?
I do not see how it happens without the driver pod failing, in which case we
can just select the pod by status.
> > And isn't there a case where the creation timestamp is the wrong pod?
I.e the older retry pod was stuck in pending, so the older one got a newer
created at timestamp, can this cause any issues?
>
> I agree. I think the approach needs to be modified so that it prioritizes
running pods with `creation_timestamp` being a tie-breaker. This is a very
valid edge case that I overlooked.
Have you implemented this? As I did not see it, overall the pr looks good, I
will look at it soon, though I would appreciate if you could answer my
questions.
Thank you
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]