Re: [PR] Make Spark driver reattachment deterministic when multiple pods match [airflow]

via GitHub Sun, 18 Jan 2026 03:34:52 -0800


Nataneljpwd commented on PR #60717:
URL: https://github.com/apache/airflow/pull/60717#issuecomment-3765196785


   > Although Spark driver pods are normally unique for a given task execution, 
multiple driver pods with identical labels can occur when the same task attempt 
is started again after an abrupt scheduler or worker interruption.
   
   As if an airflow scheduler interruption? As I don't see how this may happen, 
worker interruption seems more possible, however I don't see when it can 
happen, let's say for some reason it did exit, the pod is killed yet the driver 
stays up until it is finished, the task is marked as failed, if we have 
retries, won't it just create a new spark application and update the try number 
label?
   Or if we are in a different dagrun, the runid will be different.
   
   > In these cases, the original Spark driver pod is not cleaned up, and the 
SparkApplication may spawn a new driver pod that reuses the same labels. This 
results in multiple matching pods representing the same logical execution.
   
   The Spark application as if the CRD?
   Doesn't it monitor the driver? Or basically deploy and manage the driver? I 
think I might be missing something, or misunderstanding something.
   
   > Failing fast in this scenario causes unnecessary task failures despite a 
recoverable state. Selecting a deterministic “most recent” pod allows Airflow 
to reattach to the correct driver while still surfacing the unexpected 
condition via warning logs.
   
   What if I do want to fail the task? As if I am truncating a table which may 
not be an atomic operation, and I use airflow expecting there will be no more 
than 1 run, and so I want to be notified on failure.
   
   Is there a way to reproduce the issue stated above?
   And isn't there a case where the creation timestamp is the wrong pod? I.e 
the older retry pod was stuck in pending, so the older one got a newer created 
at timestamp, can this cause any issues?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Make Spark driver reattachment deterministic when multiple pods match [airflow]

Reply via email to