Nataneljpwd commented on PR #60717:
URL: https://github.com/apache/airflow/pull/60717#issuecomment-3798377300

    > Yes. The SparkApplication is the owner of the driver pod. The 
`SparkKubernetesOpeator` merely submits the SparkApplication, which then 
handles the lifecycle of the driver pod. I believe this would occur in a very 
narrow window between the submission of the SparkApplication and the execution 
of `find_spark_job` during which the aforementioned scheduler crash could 
occur. Consider the following timeline:
   > 
   > 1. Worker submits SparkApplication
   > 2. SparkApplication creates driver pod
   > 3. Scheduler crashes (e.g. OOM)
   > 4. SparkApplication later creates a second driver pod (with the same label)
   > 5. Scheduler restarts
   > 6. Worker runs `find_spark_job`
   > 7. `find_spark_job` sees two matching pods 
   
   How can a spark application create another driver pod?
   I do not see how it happens without the driver pod failing, in which case we 
can just select the pod by status.
   
   > > And isn't there a case where the creation timestamp is the wrong pod? 
I.e the older retry pod was stuck in pending, so the older one got a newer 
created at timestamp, can this cause any issues?
   > 
   > I agree. I think the approach needs to be modified so that it prioritizes 
running pods with `creation_timestamp` being a tie-breaker.  This is a very 
valid edge case that I overlooked. 
   
   Have you implemented this? As I did not see it, overall the pr looks good, I 
will look at it soon, though I would appreciate if you could answer my 
questions.
   Thank you
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to