Pranaykarvi opened a new pull request, #63915:
URL: https://github.com/apache/airflow/pull/63915

   ## Problem
   
   When a Kubernetes Job retries (creates a new pod after a pod failure), 
`GKEJobTrigger` keeps tracking the original pod names set at trigger creation 
time. It waits for XCom on the failed pod instead of the new retry pod. This 
causes the XCom sidecar on the retry pod to never receive a termination signal, 
leaving the pod running until the job's `activeDeadlineSeconds` is exceeded and 
failing the task.
   
   ## Fix
   
   Before XCom extraction, re-discover all current pods for the job using the 
`job-name=<job_name>` label selector. Filter to only succeeded pods and extract 
XCom from those. Falls back to original pod list if no succeeded pods are found.
   
   Added `list_pods()` async method to `GKEKubernetesAsyncHook` to support pod 
discovery by label selector.
   
   ## Testing
   
   Added unit test 
`test_run_do_xcom_push_uses_succeeded_retry_pod_not_original_failed_pod` that 
verifies the trigger uses the succeeded retry pod for XCom extraction when the 
original pod failed.
   
   Fixes #63838


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to