ashb commented on code in PR #46777:
URL: https://github.com/apache/airflow/pull/46777#discussion_r1956688304
##########
airflow/jobs/scheduler_job_runner.py:
##########
@@ -2009,7 +2010,7 @@ def _find_zombies(self, *, session: Session) -> list[TI]:
.join(DM, TI.dag_id == DM.dag_id)
.where(
TI.state.in_((TaskInstanceState.RUNNING,
TaskInstanceState.RESTARTING)),
- TI.last_heartbeat_at < limit_dttm,
+ coalesce(TI.last_heartbeat_at, TI.updated_at) < limit_dttm,
Review Comment:
Oh huh. With the switch to storing `last_heartbeat_at` on TaskInstance (in
2.x zombie tasks are done by based on the LocalTaskJob row in the Job table), I
don't think we ever clear it, which means there's a race condition where on
resuming from deferral it could be instantly picked up as a zombie!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]