Re: pg_stat_io_histogram

Ants Aasma Thu, 12 Feb 2026 02:32:01 -0800

On Thu, 12 Feb 2026 at 10:35, Jakub Wartak <[email protected]>
wrote:


> BTW how have you arrived with the "240x" number? We have 16 buckets for
> each
> of the object/context/type.
>

Sorry, I worded that poorly. I meant that we store a histogram for each
combination, 240 in total.

>
> > even then scanning an extra 25KB of mostly zeroes on every commit
> > doesn't seem great. Maybe making the histogram accumulation
> > conditional on the counter field being non-zero is enough to avoid any
> > issues? I haven't yet constructed a benchmark to see if it's actually
> > a problem or not. Select only pgbench with small shared buffers and
> > scale that fits into page cache should be an adversarial use case
> > while still being reasonably realistic.
>
> Earlier I've done some benchmarks (please see [1]) based on recommendations
> by Andres to keep low io_combine_limit for that and just tiny
> shared_buffers.
> I'm getting too much noise to derive any results, and as this is related
> to I/O even probably context switches start playing a role there... sadly
> we
> seem not to have a performance farm to answer this.
>

I glossed over the first benchmark you did. That's pretty close to what I
was talking about - exercise the stats collection part by having ~1 I/O
served from page cache per a trivial transaction. And the prewarm should
exercise the per I/O overhead. If neither of them have any measurable
overhead then I can't think of a workload where it could be worse.

TBH, I'm not sure how to progress with this, I mean we could as you say:
> - reduce PgStat_PendingIO.pending_hist_time_buckets by removing
> IOCONTEXT_NUM_TYPES
>   (not a big loss, just lack of showing BAS strategy)
>

I'm on the fence on this. For the actual problems I've had to diagnose it
wouldn't have mattered. But latency differences of bulk vs. normal access
might be useful for understanding benchmark results better. A 5x reduction
in size is pretty big.


> - we could even further reduce PgStat_PendingIO.pending_hist_time_buckets
>   by removing IOOBJECT_NUM_TYPES, but those are just 3 and they might be
>   useful
>

WAL write vs. relation write is a very useful distinction for me.

... and are You saying to try to do this below thing too?
>
> @@ -288,8 +290,9 @@ pgstat_io_flush_cb(bool nowait)
>                                 for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS;
> b++)
> -
> bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
> -
> PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
> +
>
> if(PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b]
> > 0)
> +
> bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
> +
> PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
>
>
I meant this:

@@ -287,6 +287,7 @@ pgstat_io_flush_cb(bool nowait)

bktype_shstats->times[io_object][io_context][io_op] +=
                                        INSTR_TIME_GET_MICROSEC(time);

+                               if
(PendingIOStats.counts[io_object][io_context][io_op] > 0)
                                        for(int b = 0; b <
PGSTAT_IO_HIST_BUCKETS; b++)

bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=

PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];

Most object/context/op combinations will have a 0 count, so no point in
actually looking at the histogram.


> .. but the main problem, even if I do all of that I won't be able to
> reliably measure the impact, probably the best I could say is
> "runs good as well as master, +/- 3%".
>
> Could you somehow help me with this? I mean should we reduce the
> scope(remove
> context) and add that "if"?
>

I think if we only aggregate histograms conditionally, then having a ton of
different histograms is less of a problem. Only the histograms that have
any data will get accessed. The overhead is limited to the memory usage
which I think is acceptable.

I'll run a few benchmarks on what I have available here to see if I can
tease out anything more than the no effect with a 3% error margin we have
today.


> > I'm not familiar enough with the new stats infrastructure to tell
> > whether it's a problem, but it seems odd that
> > pgstat_flush_backend_entry_io() isn't modified to aggregate the
> > histograms.
>
> Well I'm first time doing this too, and my understanding is that
> pgstat_io.c::pgstat_io_flush_cb() is flushing the global statistics
> (per backend-type) while the per-individual backend
> pgstat_flush_backend_entry_io() (from pgstat_backend.c) is more about
> per-PID-backends stats (--> for: select * from
> pg_stat_get_backend_io(PID)).
>
> In terms of this patch, the per-backend-PID-I/O histograms are not
> implemented
> yet, and I've raised this question earlier, but I'm starting to believe
> the answer is probably no, we should not implement those (more overhead
> for no apparent benefit, as most of the cases could be tracked down just
> with
> this overall per-backend-type stats ).
>

I agree that per-PID histograms are probably not worth the extra work. But
this left me wondering if we are allocating the whole set of histograms too
many times. I don't think every place that uses PgStat_BktypeIO actually
needs the histograms. I will need to dig around to understand this code a
bit better.

Regards,
Ants Aasma

Re: pg_stat_io_histogram

Reply via email to