SameerMesiah97 opened a new pull request, #60717:
URL: https://github.com/apache/airflow/pull/60717

   **Description**
   
   This change makes `def find_spark_job` deterministic when multiple Spark 
driver pods match the same task labels.
   
   Previously, the operator raised an exception if more than one matching pod 
was found. Instead, the operator now deterministically selects a single pod for 
reattachment using a stable ordering strategy:
   
   - Pods are ordered by `creation_timestamp`
   - If timestamps are identical, the pod name is used as a deterministic 
tie-breaker
   - The pod with the latest timestamp (and last lexicographical name if 
needed) is selected
   
   When duplicate pods are detected, a warning is logged to preserve visibility 
into the unexpected state.
   
   **Rationale**
   
   Although Spark driver pods are normally unique for a given task execution, 
multiple driver pods with identical labels can occur when the same task attempt 
is started again after an abrupt scheduler or worker interruption.
   
   In these cases, the original Spark driver pod is not cleaned up, and the 
SparkApplication may spawn a new driver pod that reuses the same labels. This 
results in multiple matching pods representing the same logical execution.
   
   Failing fast in this scenario causes unnecessary task failures despite a 
recoverable state. Selecting a deterministic “most recent” pod allows Airflow 
to reattach to the correct driver while still surfacing the unexpected 
condition via warning logs. 
   
   Kubernetes does not guarantee label uniqueness (please refer to 'Labels and 
Selectors' 
[here](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/).)
 As a result, multiple pods matching the same label selector is a valid and 
documented Kubernetes state. The operator must therefore handle this case 
deterministically rather than failing with an exception.
   
   **Notes**
   
   A docstring has been added to `def find_spark_job` describing the function’s 
full behavior, including handling of duplicate pods.
   
   **Tests**
   
   Added a unit test verifying that `def find_spark_job` selects the most 
recently created Spark driver pod when multiple pods share identical labels.
   
   **Backwards Compatibility**
   
   Previously, encountering multiple matching pods resulted in an 
`AirflowException`. This change replaces that failure with deterministic pod 
selection and a warning log.
   
   The behavior when exactly one matching pod is found is unchanged. In the 
common case, `def find_spark_job` continues to return the single matching Spark 
driver pod exactly as before.
   
   This change does not introduce any public API or configuration changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to