1fanwang commented on code in PR #66773:
URL: https://github.com/apache/airflow/pull/66773#discussion_r3269906184
##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -2938,7 +2944,10 @@ def _find_task_instances_without_heartbeats(self, *,
session: Session) -> list[T
.join(DM, TI.dag_id == DM.dag_id)
.where(
TI.state.in_((TaskInstanceState.RUNNING,
TaskInstanceState.RESTARTING)),
- TI.last_heartbeat_at < limit_dttm,
+ or_(
+ TI.last_heartbeat_at < limit_dttm,
Review Comment:
Circling back on this, Actually you are right @ephraimbuddy, closing this PR
since this is the incorrect RCA / fix for #58307
Ran the live repro on a freshly-migrated A3 sqlite deployment — `airflow db
migrate`, a real DAG, `airflow standalone`. Triggered, two TIs reached running
through `/run`. Stopped the scheduler, set `last_heartbeat_at = NULL` on both
rows to simulate the post-migration state, restarted with DEBUG logging.
```
19:38:30.413671 | info | Adopting or resetting orphaned tasks for active dag
runs (scheduler_job_runner.py:2815)
19:38:30.445094 | info | Reset the following 2 orphaned TaskInstances
(scheduler_job_runner.py:2894)
```
`adopt_or_reset_orphaned_tasks` rotates both into `task_instance_history`
within 32 ms of startup; fresh rows come up with `last_heartbeat_at` populated
by `/run` in 4 s. The cleanup query only ever sees the fresh rows, so the `OR
last_heartbeat_at IS NULL` predicate never fires.
The repro did surface something different in #58307. vgl-grin is on
`heartbeat_sec=0`, `heartbeat_timeout=300`, containers respawning between runs.
Their `last_heartbeat_at` isn't NULL; `/run` sets it. But the adopt path on
each respawn looks like it can land a fresh `last_heartbeat_at`, which would
restart the timeout clock indefinitely. That fits the asymmetry — `timeout=1`
or `2` terminate, `300/4/5/6/10` don't. Different code path than this PR —
leaving #58307 open.
Thanks for pushing back on this - I'll be more thorough and careful on the
repro and RCA investigation in the future.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]