1fanwang opened a new pull request, #66808:
URL: https://github.com/apache/airflow/pull/66808

   ### Problem
   
   `dag_processing.loop_duration` is aggregate. When the scheduler loop slows 
down, there's no signal pointing at the executor as the cause — 
`executor.heartbeat()` runs every iteration and can spike from Kubernetes API 
throttling, Celery broker lag, etc., but its wall time is folded into the loop 
total.
   
   ### Fix
   
   Wrap `executor.heartbeat()` in 
`stats.timer("scheduler.executor_heartbeat_duration")`, tagged with `executor: 
<ClassName>` so each configured executor reports independently in 
multi-executor deployments.
   
   ### Tests
   
   `test_executor_heartbeat_emits_timer` patches `stats.timer`, runs one 
scheduler loop iteration with the two-executor `mock_executors` fixture, and 
asserts the timer was opened once per executor with the right metric name and 
`tags={"executor": <class>}`.
   
   ### Note
   
   I scoped this to `executor.heartbeat()` only, not 
`_process_executor_events`. The two calls are structurally separate — they live 
in different loops with a fresh `create_session()` block between them — and 
their cost characteristics differ (heartbeat is executor I/O; 
`_process_executor_events` is DB writes plus event-buffer drain). They look 
like two independent signals rather than a tight pair, so a separate timer for 
`_process_executor_events` makes sense as a follow-up if the same need surfaces 
there. Happy to roll it into this PR if reviewers would prefer.
   
   Closes #66803
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to