1fanwang opened a new issue, #66800: URL: https://github.com/apache/airflow/issues/66800
### What's the problem In HA (multi-scheduler) deployments the pool slot gauges are unreliable. Each scheduler runs `_emit_pool_metrics` independently and samples the metadata DB a moment apart, so they emit different counts for the same metric (e.g. one reports `open_slots=128`, another `126`). The pool metrics carry only a `pool_name` tag — there is no per-scheduler attribute — so on OpenTelemetry the per-scheduler series collide and the backend keeps whichever export arrived last. `pool.open_slots` (and the other slot gauges) then flap between schedulers' samples and can be wrong. ### Proposal Emit a histogram alongside each existing pool slot gauge — same value, same tag, `.distribution` suffix. A histogram accumulates every scheduler's sample in the interval rather than overwriting, so operators can aggregate to the correct value — `min(open_slots)`, `max(queued/running/...)` per interval. The gauges stay unchanged for backwards compatibility. Verified end-to-end through the real OTel logger: with two schedulers reporting `126` and `128` for the same series, the gauge keeps only the last (`128`, wrong) while the histogram preserves both (`min=126`, correct). Implemented in #66810. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
