1fanwang opened a new pull request, #66767:
URL: https://github.com/apache/airflow/pull/66767
Closes #65400.
## What
When a worker stops heartbeating (OOM kill, K8s eviction, scheduler
restart), `_purge_task_instances_without_heartbeats` cleans the task instance
up by enqueuing a `TaskCallbackRequest`. That request was built without
`task_callback_type`, which `TaskCallbackRequest.is_failure_callback` treats as
a failure — so the DAG processor dispatched `on_failure_callback` even when the
task still had retries remaining. Users got spurious failure alerts for tasks
that subsequently succeeded on retry.
## Why this isn't the canonical one-liner
The canonical pattern at `scheduler_job_runner.py:1427` is
```python
task_callback_type = TaskInstanceState.UP_FOR_RETRY if
ti.is_eligible_to_retry() else TaskInstanceState.FAILED
```
That does not work here. `_purge_task_instances_without_heartbeats` doesn't
load `ti.task`, so `is_eligible_to_retry()` hits this fallback at
`taskinstance.py:1833-1835`:
```python
if not getattr(self, "task", None):
return self.try_number <= self.max_tries
```
For `retries=0` tasks (`try_number=0, max_tries=0`), that returns `True` —
misclassifying them as retry-eligible.
The fix uses `ti.max_tries > 0 and ti.try_number <= ti.max_tries` to match
the "task loaded" branch (`bool(self.task.retries and self.try_number <=
self.max_tries)`). A short comment explains the why so a future cleanup doesn't
replace it with `is_eligible_to_retry()` again.
## How was this tested?
Added parametrized regression
`test_heartbeat_timeout_sets_callback_type_by_retry_eligibility` in
`tests/unit/jobs/test_scheduler_job.py`:
- `retries_remaining` → `TaskInstanceState.UP_FOR_RETRY`
- `no_retries` → `TaskInstanceState.FAILED`
Both fail on `main` (callback_type is `None` in both cases). All 9
neighbouring tests pass (3× `test_find_and_purge_*`,
`test_scheduler_passes_context_*`, `test_external_kill_*`,
`test_heartbeat_timeout_callback_bundle_version_follows_dag_run`).
Previous attempt #65404 by @kimhaggie was closed by @potiuk for inactivity —
picking it up per his "feel free to reopen this PR when you resume work, or
open a new one addressing the issues previously raised."
## Newsfragment
`airflow-core/newsfragments/65400.bugfix.rst`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]