1fanwang opened a new issue, #66818:
URL: https://github.com/apache/airflow/issues/66818

   ### Apache Airflow version
   
   main (development)
   
   ### What happened?
   
   `DagRun.update_state()` already detects "task deadlock" — the 
all-tasks-unfinished-but-none-schedulable branch in 
`airflow-core/src/airflow/models/dagrun.py` (around line 1216):
   
   ```python
   self.log.error("Task deadlock (no runnable tasks); marking run %s failed", 
self)
   self.set_state(DagRunState.FAILED)
   self.notify_dagrun_state_changed(msg="all_tasks_deadlocked")
   ```
   
   It logs + notifies state-changed, but doesn't emit a Stats counter. 
Operators who want to alert on deadlock-induced failures end up grepping 
scheduler logs or scraping state-change notifications.
   
   ### What you think should happen instead?
   
   Emit `Stats.incr("dagrun.deadlocked", tags={"dag_id": ..., "run_type": 
...})` at the same site, so the existing statsd / OTel pipeline picks it up 
automatically.
   
   ### Use case / motivation
   
   Track deadlock-induced failure rates as a first-class signal alongside 
`zombies.zombie_unfinished_run_failure_count` and the executor-event failure 
counters. Dashboards / alerts can then chart deadlock rate per DAG and run type 
without log scraping.
   
   ### Proposal
   
   One-line `Stats.incr(...)` next to the existing log + notify call. Test 
mocks `Stats.incr` and asserts emission when the deadlock branch fires.
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's Code of Conduct
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to