holmuk commented on issue #64867: URL: https://github.com/apache/airflow/issues/64867#issuecomment-4223119479
It looks like an architectural issue in `KubernetesJobOperator` / `KubernetesJobTrigger` for `deferrable=True` / `do_xcom_push=True`. The trigger waits for container completion for every pod name from a precomputed snapshot (`pod_names`) before checking the final Job status. That snapshot is built from pod discovery tied to `parallelism`, not to actual successful completions. **Example** (`parallelism=2`, `completions=1`): - Airflow creates a Job - Kubernetes starts 2 pods - One pod succeeds - Job becomes `Complete` (`completions=1` reached) - The second pod may never reach the expected terminal state - `KubernetesJobTrigger` keeps waiting on the second pod and does not reach Job-status evaluation, so the task can remain Running/Deferred forever. **Proposed fix:** Task completion should be driven by Job terminal status (`Complete` / `Failed`), which already reflects `completions`: - Make Job status the primary completion condition - Collect XCom/logs only as best-effort from pods that actually finished and are still readable - Do not block task finalization on missing/invalid/non-terminal pods from the initial snapshot @jedcunningham @hussein-awala @jscheffl if you accept the proposed fix, I can implement it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
