Re: [PR] Make Spark driver reattachment deterministic when multiple pods match [airflow]

via GitHub Sun, 18 Jan 2026 08:42:51 -0800


SameerMesiah97 commented on PR #60717:
URL: https://github.com/apache/airflow/pull/60717#issuecomment-3765325623


   > As if an airflow scheduler interruption? As I don't see how this may 
happen, worker interruption seems more possible, however I don't see when it 
can happen, let's say for some reason it did exit, the pod is killed yet the 
driver stays up until it is finished, the task is marked as failed, if we have 
retries, won't it just create a new spark application and update the try number 
label? Or if we are in a different dagrun, the runid will be different.
   
   I admit this is a rare edge case. The scenario I am referring to is one 
where the scheduler might get OOM killed and this prevents the task state from 
being updated. So on scheduler restart, the task will resume from the same try 
number and look for an existing driver pod which has a label with the exact 
same context variables as before the crash. 
   
   > The Spark application as if the CRD? Doesn't it monitor the driver? Or 
basically deploy and manage the driver? I think I might be missing something, 
or misunderstanding something.
   
   Yes. The SparkApplication is the owner of the driver pod. The 
`SparkKubernetesOpeator` merely submits the SparkApplication, which then 
handles the lifecycle of the driver pod. I believe this would occur in a very 
narrow window between the submission of the SparkApplication and the execution 
of `find_spark_job` during which the aforementioned scheduler crash could 
occur. Consider the following timeline:
   
   1. Worker submits SparkApplication
   2. SparkApplication creates driver pod
   3. Scheduler crashes (e.g. OOM)
   4. SparkApplication later creates a second driver pod (with the same label)
   5. Scheduler restarts
   6. Worker runs `find_spark_job`
   7. `find_spark_job` sees two matching pods 
   
   > What if I do want to fail the task? As if I am truncating a table which 
may not be an atomic operation, and I use airflow expecting there will be no 
more than 1 run, and so I want to be notified on failure.
   
   I believe this point is somewhat tangential to deterministic pod selection. 
This is not about multiple task runs but failsafe pod recovery in a very 
specific scenario. And if you want to fail the task under certain conditions, 
using pod discovery as a trigger for task success or failure is brittle as 
there can be multiple pods per task. It would be better to use something else 
such as exit code or return value for the operation as that is more explicit. 
If we take your table truncation example, that would be the failure/success 
event fired by the DB. Also, under this new implementation, users will be 
notified when there are multiple pods. There just won't be a hard failure like 
before. 
   
   I do admit that I may have downplayed this in the PR description. This is a 
significant behavior change as mutliple pods may lead to soft recovery instead 
of hard failure. But that is because this does not indicate anything critically 
wrong within Airflow but due to the idiosyncracies with how a SparkApplication 
handles pod reconci;iation and creation.   
   
   > Is there a way to reproduce the issue stated above? 
   
   It's possible but very difficult to do so deterministically. If you look at 
the existing condition, it appears to imply that multiple pods with the same 
label being picked up is a plausible scenario. This is what initially motivated 
me to produce this PR. 
   
   > And isn't there a case where the creation timestamp is the wrong pod? I.e 
the older retry pod was stuck in pending, so the older one got a newer created 
at timestamp, can this cause any issues?
   
   I agree. I think the approach needs to be modified so that it prioritizes 
running pods with `creation_timestamp` being a tie-breaker.  This is a very 
valid edge case that I overlooked. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Make Spark driver reattachment deterministic when multiple pods match [airflow]

Reply via email to