Re: pg_stat_io_histogram

Ants Aasma Wed, 11 Feb 2026 04:43:12 -0800

On Tue, 27 Jan 2026 at 14:06, Jakub Wartak
<[email protected]> wrote:
> > Hm. Isn't 128us a pretty high floor for at least reads and writes? On a good
> > NVMe disk you'll get < 10us, after all.
>
> I was blind and concentrated way too much on the bad-behaving I/O rather than 
> good
>  I/O - let's call it I/O negativity bias 8)
>
> Now v2 contains the min bucket lowered to 8us (but max then is just ~131ms, I
> didn't want it to use more than 64b total, 16*4b (uint32)=64b and well
> 16*8b(uint64)=128b already, so that's why it's capped max at 131072us right 
> now).


I have toyed around with similar histogram implementations as I have
dealt with multiple cases where having a latency histogram would have
made diagnosis much faster. So thank you for working on this.

I think it would be useful to have a max higher than 131ms. I've seen
some cases with buggy multipathing driver and self-DDOS'ing networking
hardware where the problem latencies have been in the 20s - 60s range.
Being able to attribute the whole time to I/O allows quickly ruling
out other problems. Seeing a count in 131ms+ bucket is a strong hint,
seeing a count in 34s-68s bucket is a smoking gun.

Is the main concern for limiting the range cache-misses/pollution when
counting I/O or is it memory overhead and cost of collecting?

It seems quite wasteful to replicate the histogram 240x for each
object/context/op combination. I don't think it matters for I/O
instrumentation overhead - each backend is only doing a limited amount
of different I/O categories and the lower buckets are likely to be on
the same cache line with the counter that gets touched anyway. For
higher buckets the overhead should be negligible compared to the cost
of the I/O itself.

What I'm worried about is that this increases PgStat_PendingIO from
5.6KB to 30KB. This whole chunk of memory needs to be scanned and
added to shared memory structures element by element. Compiler auto
vectorization doesn't seem to kick in on pgstat_io_flush_cb(), but
even then scanning an extra 25KB of mostly zeroes on every commit
doesn't seem great. Maybe making the histogram accumulation
conditional on the counter field being non-zero is enough to avoid any
issues? I haven't yet constructed a benchmark to see if it's actually
a problem or not. Select only pgbench with small shared buffers and
scale that fits into page cache should be an adversarial use case
while still being reasonably realistic.

I'm not familiar enough with the new stats infrastructure to tell
whether it's a problem, but it seems odd that
pgstat_flush_backend_entry_io() isn't modified to aggregate the
histograms.

Regards,
Ants Aasma

Re: pg_stat_io_histogram

Reply via email to