1fanwang opened a new pull request, #66767:
URL: https://github.com/apache/airflow/pull/66767

   Closes #65400.
   
   ## What
   
   When a worker stops heartbeating (OOM kill, K8s eviction, scheduler 
restart), `_purge_task_instances_without_heartbeats` cleans the task instance 
up by enqueuing a `TaskCallbackRequest`. That request was built without 
`task_callback_type`, which `TaskCallbackRequest.is_failure_callback` treats as 
a failure — so the DAG processor dispatched `on_failure_callback` even when the 
task still had retries remaining. Users got spurious failure alerts for tasks 
that subsequently succeeded on retry.
   
   ## Why this isn't the canonical one-liner
   
   The canonical pattern at `scheduler_job_runner.py:1427` is
   ```python
   task_callback_type = TaskInstanceState.UP_FOR_RETRY if 
ti.is_eligible_to_retry() else TaskInstanceState.FAILED
   ```
   That does not work here. `_purge_task_instances_without_heartbeats` doesn't 
load `ti.task`, so `is_eligible_to_retry()` hits this fallback at 
`taskinstance.py:1833-1835`:
   ```python
   if not getattr(self, "task", None):
       return self.try_number <= self.max_tries
   ```
   For `retries=0` tasks (`try_number=0, max_tries=0`), that returns `True` — 
misclassifying them as retry-eligible.
   
   The fix uses `ti.max_tries > 0 and ti.try_number <= ti.max_tries` to match 
the "task loaded" branch (`bool(self.task.retries and self.try_number <= 
self.max_tries)`). A short comment explains the why so a future cleanup doesn't 
replace it with `is_eligible_to_retry()` again.
   
   ## How was this tested?
   
   Added parametrized regression 
`test_heartbeat_timeout_sets_callback_type_by_retry_eligibility` in 
`tests/unit/jobs/test_scheduler_job.py`:
   
   - `retries_remaining` → `TaskInstanceState.UP_FOR_RETRY`
   - `no_retries` → `TaskInstanceState.FAILED`
   
   Both fail on `main` (callback_type is `None` in both cases). All 9 
neighbouring tests pass (3× `test_find_and_purge_*`, 
`test_scheduler_passes_context_*`, `test_external_kill_*`, 
`test_heartbeat_timeout_callback_bundle_version_follows_dag_run`).
   
   Previous attempt #65404 by @kimhaggie was closed by @potiuk for inactivity — 
picking it up per his "feel free to reopen this PR when you resume work, or 
open a new one addressing the issues previously raised."
   
   ## Newsfragment
   
   `airflow-core/newsfragments/65400.bugfix.rst`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to