On Tue, 27 Jan 2026 at 14:06, Jakub Wartak <[email protected]> wrote: > > Hm. Isn't 128us a pretty high floor for at least reads and writes? On a good > > NVMe disk you'll get < 10us, after all. > > I was blind and concentrated way too much on the bad-behaving I/O rather than > good > I/O - let's call it I/O negativity bias 8) > > Now v2 contains the min bucket lowered to 8us (but max then is just ~131ms, I > didn't want it to use more than 64b total, 16*4b (uint32)=64b and well > 16*8b(uint64)=128b already, so that's why it's capped max at 131072us right > now).
I have toyed around with similar histogram implementations as I have dealt with multiple cases where having a latency histogram would have made diagnosis much faster. So thank you for working on this. I think it would be useful to have a max higher than 131ms. I've seen some cases with buggy multipathing driver and self-DDOS'ing networking hardware where the problem latencies have been in the 20s - 60s range. Being able to attribute the whole time to I/O allows quickly ruling out other problems. Seeing a count in 131ms+ bucket is a strong hint, seeing a count in 34s-68s bucket is a smoking gun. Is the main concern for limiting the range cache-misses/pollution when counting I/O or is it memory overhead and cost of collecting? It seems quite wasteful to replicate the histogram 240x for each object/context/op combination. I don't think it matters for I/O instrumentation overhead - each backend is only doing a limited amount of different I/O categories and the lower buckets are likely to be on the same cache line with the counter that gets touched anyway. For higher buckets the overhead should be negligible compared to the cost of the I/O itself. What I'm worried about is that this increases PgStat_PendingIO from 5.6KB to 30KB. This whole chunk of memory needs to be scanned and added to shared memory structures element by element. Compiler auto vectorization doesn't seem to kick in on pgstat_io_flush_cb(), but even then scanning an extra 25KB of mostly zeroes on every commit doesn't seem great. Maybe making the histogram accumulation conditional on the counter field being non-zero is enough to avoid any issues? I haven't yet constructed a benchmark to see if it's actually a problem or not. Select only pgbench with small shared buffers and scale that fits into page cache should be an adversarial use case while still being reasonably realistic. I'm not familiar enough with the new stats infrastructure to tell whether it's a problem, but it seems odd that pgstat_flush_backend_entry_io() isn't modified to aggregate the histograms. Regards, Ants Aasma
