[I] Emit histograms alongside pool slot gauges to capture distribution between scrapes [airflow]

via GitHub Thu, 04 Jun 2026 23:36:27 -0700


1fanwang opened a new issue, #66800:
URL: https://github.com/apache/airflow/issues/66800


   ### What's the problem
   
   In HA (multi-scheduler) deployments the pool slot gauges are unreliable. 
Each scheduler runs `_emit_pool_metrics` independently and samples the metadata 
DB a moment apart, so they emit different counts for the same metric (e.g. one 
reports `open_slots=128`, another `126`). The pool metrics carry only a 
`pool_name` tag — there is no per-scheduler attribute — so on OpenTelemetry the 
per-scheduler series collide and the backend keeps whichever export arrived 
last. `pool.open_slots` (and the other slot gauges) then flap between 
schedulers' samples and can be wrong.
   
   ### Proposal
   
   Emit a histogram alongside each existing pool slot gauge — same value, same 
tag, `.distribution` suffix. A histogram accumulates every scheduler's sample 
in the interval rather than overwriting, so operators can aggregate to the 
correct value — `min(open_slots)`, `max(queued/running/...)` per interval. The 
gauges stay unchanged for backwards compatibility.
   
   Verified end-to-end through the real OTel logger: with two schedulers 
reporting `126` and `128` for the same series, the gauge keeps only the last 
(`128`, wrong) while the histogram preserves both (`min=126`, correct).
   
   Implemented in #66810.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Emit histograms alongside pool slot gauges to capture distribution between scrapes [airflow]

Reply via email to