wolfdn opened a new pull request, #66705:
URL: https://github.com/apache/airflow/pull/66705
<!-- SPDX-License-Identifier: Apache-2.0
https://www.apache.org/licenses/LICENSE-2.0 -->
<!--
Thank you for contributing!
Please provide above a brief description of the changes made in this pull
request.
Write a good git commit message following this guide:
http://chris.beams.io/posts/git-commit/
Please make sure that your code changes are covered with tests.
And in case of new features or big changes remember to adjust the
documentation.
Feel free to ping (in general) for the review if you do not see reaction for
a few days
(72 Hours is the minimum reaction time you can expect from volunteers) - we
sometimes miss notifications.
In case of an existing issue, reference it using one of the following:
* closes: #ISSUE
* related: #ISSUE
-->
## Problem
When running Kubernetes Pods in deferred mode, the triggerer can return an
`error` event due to transient communication issues with the Kubernetes API
(e.g. timeouts, connection resets), even though the pod is still running
normally. In the current implementation, this causes the task to fail
immediately — even though the pod is healthy and still executing.
Additionally, if the trigger emits an `error` event but the base container
has actually completed successfully (exit code 0), the task is unnecessarily
marked as failed.
## Solution
### 1. Re-defer on transient errors when pod is still alive
When `trigger_reentry` receives an `error` event, it now checks the actual
pod state before failing the task. If the pod's base container is still
running, waiting, or the pod is pending (and there are no fatal issues like
`InvalidImageName`), the task is re-deferred to the triggerer to continue
monitoring.
A `MAX_REDEFER_ATTEMPTS = 3` limit prevents infinite re-defer loops if the
Kubernetes API is persistently unreachable. The re-defer count is tracked via
`trigger_kwargs` which round-trips through the trigger's emitted events.
Re-deferring is scoped to `"error"` events only. `"timeout"` and `"failed"`
events represent deliberate decisions by the trigger (pod launch timeout,
container failure) and are not retried.
### 2. Treat error as success when container actually succeeded
If the trigger emits an `error` event but the base container has already
terminated with exit code 0, the task is now treated as successful instead of
failing.
### 3. Remove misleading `except TaskDeferred: raise`
The old code had `except TaskDeferred: raise` before the `finally` block,
which appeared to prevent `_clean()` from running during re-deferral. In
reality, Python's `finally` always executes — even after a re-raised exception
— so this guard was ineffective. The re-defer logic is now placed **before**
the `try/finally` block, ensuring `_clean()` (which may delete the
still-running pod) does not execute on re-deferral.
---
##### Was generative AI tooling used to co-author this PR?
<!--
If generative AI tooling has been used in the process of authoring this PR,
please
change below checkbox to `[X]` followed by the name of the tool, uncomment
the "Generated-by".
-->
- [x] Yes (please specify the tool below)
GitHub Copilot - Claude Opus 4.6
<!--
Generated-by: [Tool Name] following [the
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
-->
---
* Read the **[Pull Request
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
for more information. Note: commit author/co-author name and email in commits
become permanently public when merged.
* For fundamental code changes, an Airflow Improvement Proposal
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
is needed.
* When adding dependency, check compliance with the [ASF 3rd Party License
Policy](https://www.apache.org/legal/resolved.html#category-x).
* For significant user-facing changes create newsfragment:
`{pr_number}.significant.rst`, in
[airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments).
You can add this file in a follow-up commit after the PR is created so you
know the PR number.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]