On Mon, Mar 16, 2026 at 11:50 PM Tomas Vondra <[email protected]> wrote:

Hi Tomas,

thanks for reviewing! New version attached addressing the need of rebase,
some review concerns, I've still left unsquashed ideas for memory use reduction
to allow easier reference points.

> I've looked at this patch today, as it's tangentially related to the
> patch adding prefetch stats to EXPLAIN. And earlier WIP version of the
> EXPLAIN patch included a couple histograms (not for timings, but for IO
> sizes, prefetch distances, etc.). And during development it was quite
> useful. We decided to keep EXPLAIN simpler, but having that at least in
> a pg_stat_io_histogram view would be nice. So +1 to this.
>
> I went through the thread today. A lot has been discussed, most issues
> seem to have been solved, but what are the outstanding ones? I think it
> would be helpful if someone involved (=Jakub) could write a brief
> summary, listing the various open questions.

I think the feature is fine as it is, however there may be 3 potentially
unsolved topics:

1. Concerns about memory use. With v7 I had couple of ideas, and with those
the memory use is really minimized as long as the code is still simple
(so nothing fancy, just some ideas to trim stuff and dynamically allocate
memory). I hope those reduce memory footprint to acceptable levels, see my
earlier description for v7.

2. There were concerns about time conversion overhead. I think benchmarks
right now prove that right now we do not have any problems.

3. Probably open thing is bucket width (AKA how much we want to make it a
tool for spotting outliers vs how much we want to make it tool for a
IO performance analysis).

> Now let me add a couple open questions of my own ;-)
>
> My understanding is that the performance (i.e. in terms of CPU) is fine.
> I'm running some tests on my own, both to check the behavior and to
> learn about it. And so far the results show no change in performance,
> it's all within 1% of master. So that's fine.

Right, so far multiple tests shown that CPU impact is negligible thanks
to using that simple multiple dimensons array...

> memory usage
> ------------
> AFAICS the primary outstanding issue seems to be how to represent the
> histograms in each backend, so that it's not wasteful but also not
> overly complex. I'm not sure what's the current situation, and how far
> from acceptable it is.

Correct, the crux of the issue is that if the array used to store the
histograms is not taking too much memory. We would probably need to
hear from Andres if that acceptable or not.

Further memory savvy data structures could used probably be:
- tile-based allocated array
- hash structure
However both of which would cause potentially bigger CPU impact, bigger
complexity and likely they would end up (in primitive implementation)
dynamic allocation of memory in critical areas (where it would blow up),
so that would have to be patched by pre-allocating based on - probably -
backend type and expected I/O there? On negative side, there's also concern
for moving that to PG20 then.

> histogram range
> ---------------
> Another questions is what should be the range of the histogram, and
> whether the buckets should be uint32 or uint64. It's somewhat related to
> the previous question, because smaller histograms need less memory
> (obviously). I think the simpler the better, which means fixed-size
> histograms, with a fixed number of buckets of equal size (e.g. uint64).

+1

> But that implies limited range / precision, so I think we need to decide
> whether we prefer accurate buckets for low of high latencies.
>
> The motivation for adding the histograms was investigating performance
> issues with storage, which involves high latencies. So that would prefer
> better tracking of higher latencies (and accepting lower resolution for
> buckets close to 0). In v7 the first bucket is [0,8) microsecs, which to
> me seems unnecessarily detailed. What if we started with 64us? That'd
> get us to ~1s in the last bucket, and I'd imagine that's enough. We
> could use the last bucket as "latencies above 1s". If you have a lot of
> latencies beyond 1s, you have serious problems.
>
> Yes, you can get 10us with NVMe, and so if everything works OK
> everything will fall into the first bucket. So what? I think it's a
> reasonable trade off. We have to compromise somewhere.

I think some minor use-case that I have tried to cover is answering if
stuff hit buffer cache or the device itself, but I'm free to adjust it
again if there's consensus.

Earlier with Andres we had following exchange in this thread:

> > (..) The current implementation uses fast bucket
> > calculation to avoid overheads and tries to cover most useful range of
> > devices via buckets (128us..256ms, so that covers both NVMe/SSD/HDD and
> > abnormally high latency too as from time to time I'm try to help with I/O
> > stuck for *seconds*, usually a sign of some I/O multipath issues, device
> > resetting, or hypervisor woes).

> Hm. Isn't 128us a pretty high floor for at least reads and writes? On a good
> NVMe disk you'll get < 10us, after all.

so I've interpretted 8us as sweet spot and as v8 stands out max is 128ms
(which to me is high indicator of stuff being broken anyway, and Ants also
wanted to have it much higher):

> I think it would be useful to have a max higher than 131ms. I've seen
> some cases with buggy multipathing driver and self-DDOS'ing networking
> hardware where the problem latencies have been in the 20s - 60s range.
> Being able to attribute the whole time to I/O allows quickly ruling
> out other problems. Seeing a count in 131ms+ bucket is a strong hint,
> seeing a count in 34s-68s bucket is a smoking gun.

but that would again raise the memory consumption even higher.

> Alternatively, we could make the histograms more complex. We could learn
> a thing or two from ddsketch, for example - it can dynamically change
> the range of the histogram, depending on input.
>
> We could also make the buckets variable-sized. The buckets have
> different widths, and assuming uniform distribution will get different
> number of matches - with bucket N getting ~1/2 of bucket (N+1). So it
> could be represented by one fewer bit. But it adds complexity, and IO
> latencies are unlikely to be uniformly distributed.
>
> Alternatively we could use uint32 buckets, as proposed by Andres:
>
> > I guess we could count IO as 4 byte integers, and shift all bucket
> > counts down in the rare case of an on overflow. It's just a 2x
> > improvement, but ...
>
> That'd mean we start sampling the letencies, and only add 50% of them to
> the histogram. And we may need to do that repeatedly, cutting the sample
> rate in 1/2 every time. Which is probably fine for the purpose of this
> view, but it adds complexity, and it means you have to "undo" this when
> displaying the data. Otherwise it'd be impossible to combine or compare
> histograms.
>
> Anyway, what I'm trying to say is that we should keep the histograms
> simple, at least for now.

Yes, my main goal was to stick it to being simple as priority and nearly zero
CPU impact. Secondary tradeoff in my opinion would be some memory use, but
only for those who enable it via GUCs. Any way if I change
PGSTAT_IO_HIST_BUCKETS from 16 to 24 (so 16*8 = 128b [2 cache lines]
to 24*8=196b
[3 cache lines]), I get:
- latency resolution of 8us up to 32s (instead of max 128ms)
- shm 'Shared Memory Stats' increases from 482944 to 575104 bytes
(with track* GUCs)
- no impact for backends not using track*io_timings
- if track* GUCs are set then uint64_t hist_time_buckets becomes [3][5][8][24],
  so growth from ~15kB to ~23kB

> other histograms
> ----------------
> As mentioned, the EXPLAIN PoC patch had I/O histograms for I/O sizes,
> in-progress I/Os, etc. I wonder if maybe it'd make sense to have
> something like that in pg_stat_io_histogram too. Ofc, that goes against
> the effort to reduce the memory usage etc.

Hm, you got me thinking, maybe we should rename this pg_stat_io_lat_histogram,
because that way we would make place (in future) for the others like
pg_stat_io_size_histogram, etc.

> a couple minor review comments
> ------------------------------
>
> 1) There seems to be a bug in calculating the buckets in the SRF:
[..]
> Notice the upper boundary is includes the lower boundary of the next
> bucket. It should be [0,8), [8,...). pg_stat_io_histogram_build_tuples
> should probably set "upper.inclusive = false;".

Right, I was blind, fixed.

> 2) This change in monitoring.sgml seems wrong:
>
>    <structname>pg_stat_io_histogram</structname> set of views ...
>
> AFAICS it should still say "pg_stat_io set of views", but maybe it
> should mention the pg_stat_io_histogram too.

Fixed, that was wrongly placed as this view has nothing to do with buffer cache
hit ratio.

> 3) pg_leftmost_one_pos64 does this:
>
>   #if SIZEOF_LONG == 8
>     return 63 - __builtin_clzl(word);
>   #elif SIZEOF_LONG_LONG == 8
>     return 63 - __builtin_clzll(word);
>   #else
>
> Shouldn't pg_leading_zero_bits64 do the same thing?

Windows, cough, should be fixed.

> 4) I'm not sure about this pattern:
>
>     PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
>     if(bktype == -1)
>         continue;
>
> Maybe it won't trigger an out-of-bounds read, it the compiler is smart
> enough to delay the access to when the pointer is really needed. But it
> seems confusing / wrong, and I don't think we do this elsewhere. For
> example the functions in pgstat_io.c do this:
>
>     PgStat_BktypeIO *bktype_shstats;
>     if (bktype == -1)
>         continue;
>     bktype_shstats = &pgStatLocal.shmem->io.stats.stats[bktype];

Yes, yes, it was temporary and I was pretty sure I eradicated all code like
this earlier, and I did remove it from pgstat_io.c (fun-fact: sometimes it did
fail on tests with -O0 AFAIK with ubsan/asan, but it does not on -O2). I've
missed this one and have fixed it now in src/backend/utils/adt/pgstatfuncs.c
too, thanks for spotting.

-J.
From b6935f0e2e916a47c1028df31e5b353d63766823 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 6 Mar 2026 08:25:54 +0100
Subject: [PATCH v8 2/6] PendingBackendStats save memory

---
 src/backend/utils/activity/pgstat_backend.c |  4 ++--
 src/include/pgstat.h                        | 16 ++++++++++++----
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index f2f8d3ff75f..4cd3fb923c9 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -167,7 +167,7 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 {
 	PgStatShared_Backend *shbackendent;
 	PgStat_BktypeIO *bktype_shstats;
-	PgStat_PendingIO pending_io;
+	PgStat_BackendPendingIO pending_io;
 
 	/*
 	 * This function can be called even if nothing at all has happened for IO
@@ -204,7 +204,7 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
-	MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_PendingIO));
+	MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_BackendPendingIO));
 
 	backend_has_iostats = false;
 }
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f9eece29572..0e689e0a730 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -520,15 +520,23 @@ typedef struct PgStat_Backend
 } PgStat_Backend;
 
 /* ---------
- * PgStat_BackendPending	Non-flushed backend stats.
+ * PgStat_BackendPending(IO)	Non-flushed backend stats.
  * ---------
  */
+typedef struct PgStat_BackendPendingIO {
+	uint64          bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter  counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time      pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+} PgStat_BackendPendingIO;
+
 typedef struct PgStat_BackendPending
 {
 	/*
-	 * Backend statistics store the same amount of IO data as PGSTAT_KIND_IO.
-	 */
-	PgStat_PendingIO pending_io;
+	* Backend statistics store almost the same amount of IO data as
+	* PGSTAT_KIND_IO. The only difference between PgStat_BackendPendingIO
+	* and PgStat_PendingIO is that the latter also track IO latency histograms.
+	*/
+	PgStat_BackendPendingIO pending_io;
 } PgStat_BackendPending;
 
 /*
-- 
2.43.0

From 124b04d6dcdb61dd01b0cf7122f161a265363f82 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 6 Mar 2026 13:29:40 +0100
Subject: [PATCH v8 5/6] Condense PgStat_IO.stats[BACKEND_NUM_TYPES] array by
 using PGSTAT_USED_BACKEND_NUM_TYPES to be more memory efficient.

---
 src/backend/utils/activity/pgstat_io.c | 57 +++++++++++++++++++++++---
 src/backend/utils/adt/pgstatfuncs.c    | 28 ++++++++-----
 src/include/miscadmin.h                |  2 +-
 src/include/pgstat.h                   |  5 ++-
 4 files changed, 75 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 8605ea65605..1e9bff4da41 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -225,13 +225,14 @@ pgstat_io_flush_cb(bool nowait)
 {
 	LWLock	   *bktype_lock;
 	PgStat_BktypeIO *bktype_shstats;
+	BackendType condensedBkType = pgstat_remap_condensed_bktype(MyBackendType);
 
 	if (!have_iostats)
 		return false;
 
 	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
 	bktype_shstats =
-		&pgStatLocal.shmem->io.stats.stats[MyBackendType];
+		&pgStatLocal.shmem->io.stats.stats[condensedBkType];
 
 	if (!nowait)
 		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
@@ -360,7 +361,11 @@ pgstat_io_reset_all_cb(TimestampTz ts)
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
 		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
-		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		BackendType bktype = pgstat_remap_condensed_bktype(i);
+		PgStat_BktypeIO *bktype_shstats;
+		if(bktype == -1)
+			continue;
+		bktype_shstats = &pgStatLocal.shmem->io.stats.stats[bktype];
 
 		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
 
@@ -386,8 +391,13 @@ pgstat_io_snapshot_cb(void)
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
 		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
-		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
-		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io->stats[i];
+		BackendType bktype = pgstat_remap_condensed_bktype(i);
+		PgStat_BktypeIO *bktype_shstats;
+		PgStat_BktypeIO *bktype_snap;
+		if(bktype == -1)
+				continue;
+		bktype_shstats = &pgStatLocal.shmem->io.stats.stats[bktype];
+		bktype_snap = &pgStatLocal.snapshot.io->stats[bktype];
 
 		LWLockAcquire(bktype_lock, LW_SHARED);
 
@@ -419,7 +429,8 @@ pgstat_io_snapshot_cb(void)
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
 *
-* When adding a new BackendType, also consider adding relevant restrictions to
+* When adding a new BackendType, ensure that pgstat_remap_condensed_bktype()
+* is updated and also consider adding relevant restrictions to
 * pgstat_tracks_io_object() and pgstat_tracks_io_op().
 */
 bool
@@ -457,6 +468,42 @@ pgstat_tracks_io_bktype(BackendType bktype)
 	return false;
 }
 
+
+/*
+ * Remap sparse backend type IDs to contiguous ones. Keep in sync with enum
+ * BackendType and PGSTAT_USED_BACKEND_NUM_TYPES count.
+ *
+ * Returns -1 if the input ID is invalid or unused.
+ */
+int
+pgstat_remap_condensed_bktype(BackendType bktype) {
+	/* -1 here means it should not be used */
+	static const int mapping_table[BACKEND_NUM_TYPES] = {
+		-1, /* B_INVALID */
+		0,
+		-1, /* B_DEAD_END_BACKEND */
+		1,
+		2,
+		3,
+		4,
+		5,
+		6,
+		-1, /* B_ARCHIVER */
+		7,
+		8,
+		8,
+		10,
+		11,
+		12,
+		13,
+		-1  /* B_LOGGER */
+	};
+
+	if (bktype < 0 || bktype > BACKEND_NUM_TYPES)
+		return -1;
+	return mapping_table[bktype];
+}
+
 /*
  * Some BackendTypes do not perform IO on certain IOObjects or in certain
  * IOContexts. Some IOObjects are never operated on in some IOContexts. Check
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b457d771474..2f321f5b20c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1581,9 +1581,13 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 	backends_io_stats = pgstat_fetch_stat_io();
 
-	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
-		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+		BackendType bktype = pgstat_remap_condensed_bktype(i);
+		PgStat_BktypeIO *bktype_stats;
+		if(bktype == -1)
+			continue;
+		bktype_stats = &backends_io_stats->stats[bktype];
 
 		/*
 		 * In Assert builds, we can afford an extra loop through all of the
@@ -1591,17 +1595,17 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 		 * expected stats are non-zero, since it keeps the non-Assert code
 		 * cleaner.
 		 */
-		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, i));
 
 		/*
 		 * For those BackendTypes without IO Operation stats, skip
 		 * representing them in the view altogether.
 		 */
-		if (!pgstat_tracks_io_bktype(bktype))
+		if (!pgstat_tracks_io_bktype(i))
 			continue;
 
 		/* save tuples with data from this PgStat_BktypeIO */
-		pg_stat_io_build_tuples(rsinfo, bktype_stats, bktype,
+		pg_stat_io_build_tuples(rsinfo, bktype_stats, i,
 								backends_io_stats->stat_reset_timestamp);
 	}
 
@@ -1760,9 +1764,13 @@ pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
 
 	backends_io_stats = pgstat_fetch_stat_io();
 
-	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
-		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+		BackendType bktype = pgstat_remap_condensed_bktype(i);
+		PgStat_BktypeIO *bktype_stats;
+		if(bktype == -1)
+			continue;
+		bktype_stats = &backends_io_stats->stats[bktype];
 
 		/*
 		 * In Assert builds, we can afford an extra loop through all of the
@@ -1770,17 +1778,17 @@ pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
 		 * expected stats are non-zero, since it keeps the non-Assert code
 		 * cleaner.
 		 */
-		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, i));
 
 		/*
 		 * For those BackendTypes without IO Operation stats, skip
 		 * representing them in the view altogether.
 		 */
-		if (!pgstat_tracks_io_bktype(bktype))
+		if (!pgstat_tracks_io_bktype(i))
 			continue;
 
 		/* save tuples with data from this PgStat_BktypeIO */
-		pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, bktype,
+		pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, i,
 								backends_io_stats->stat_reset_timestamp);
 	}
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f16f35659b9..d0c62d3248e 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -332,7 +332,7 @@ extern void SwitchBackToLocalLatch(void);
  * MyBackendType indicates what kind of a backend this is.
  *
  * If you add entries, please also update the child_process_kinds array in
- * launch_backend.c.
+ * launch_backend.c and PGSTAT_USED_BACKEND_NUM_TYPES in pgstat.h
  */
 typedef enum BackendType
 {
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 34a0ece0dbb..bca588a9dad 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -358,10 +358,12 @@ typedef struct PgStat_PendingIO
 
 extern PgStat_PendingIO PendingIOStats;
 
+/* This needs to stay in sync with pgstat_tracks_io_bktype() */
+#define PGSTAT_USED_BACKEND_NUM_TYPES BACKEND_NUM_TYPES - 4
 typedef struct PgStat_IO
 {
 	TimestampTz stat_reset_timestamp;
-	PgStat_BktypeIO stats[BACKEND_NUM_TYPES];
+	PgStat_BktypeIO stats[PGSTAT_USED_BACKEND_NUM_TYPES];
 } PgStat_IO;
 
 typedef struct PgStat_StatDBEntry
@@ -638,6 +640,7 @@ extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
 extern const char *pgstat_get_io_op_name(IOOp io_op);
 
+extern int pgstat_remap_condensed_bktype(BackendType bktype);
 extern bool pgstat_tracks_io_bktype(BackendType bktype);
 extern bool pgstat_tracks_io_object(BackendType bktype,
 									IOObject io_object, IOContext io_context);
-- 
2.43.0

From 4eeaff236c9ec6e8d095e2b8505a9112513fa6c5 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 23 Jan 2026 08:10:09 +0100
Subject: [PATCH v8 1/6] Add pg_stat_io_histogram view to provide more detailed
 insight into IO profile

pg_stat_io_histogram displays a histogram of IO latencies for specific
backend_type, object, context and io_type. The histogram has buckets that allow
faster identification of I/O latency outliers due to faulty hardware and/or
misbehaving I/O stack. Such I/O outliers e.g. slow fsyncs could sometimes
cause intermittent issues e.g. for COMMIT or affect the synchronous standbys
performance.

Author: Jakub Wartak <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Ants Aasma <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAKZiRmwvE4uJLKTgPXeBA4m%2Bd4tTghayoefcaM9%3Dz3_S7i72GA%40mail.gmail.com
---
 configure                              |  38 ++++
 configure.ac                           |   1 +
 doc/src/sgml/config.sgml               |  12 +-
 doc/src/sgml/monitoring.sgml           | 291 +++++++++++++++++++++++++
 doc/src/sgml/wal.sgml                  |   5 +-
 meson.build                            |   1 +
 src/backend/catalog/system_views.sql   |  11 +
 src/backend/utils/activity/pgstat_io.c |  63 ++++++
 src/backend/utils/adt/pgstatfuncs.c    | 145 ++++++++++++
 src/include/catalog/pg_proc.dat        |   9 +
 src/include/pgstat.h                   |  14 ++
 src/include/port/pg_bitutils.h         |  38 +++-
 src/test/regress/expected/rules.out    |   8 +
 src/test/regress/expected/stats.out    |  23 ++
 src/test/regress/sql/stats.sql         |  15 ++
 src/tools/pgindent/typedefs.list       |   1 +
 16 files changed, 668 insertions(+), 7 deletions(-)

diff --git a/configure b/configure
index 5aec0afa9ab..8416093fd5b 100755
--- a/configure
+++ b/configure
@@ -16012,6 +16012,44 @@ cat >>confdefs.h <<_ACEOF
 #define HAVE__BUILTIN_CLZ 1
 _ACEOF
 
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clzl" >&5
+$as_echo_n "checking for __builtin_clzl... " >&6; }
+if ${pgac_cv__builtin_clzl+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+call__builtin_clzl(unsigned long x)
+{
+    return __builtin_clzl(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_clzl=yes
+else
+  pgac_cv__builtin_clzl=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_clzl" >&5
+$as_echo "$pgac_cv__builtin_clzl" >&6; }
+if test x"${pgac_cv__builtin_clzl}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_CLZL 1
+_ACEOF
+
 fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctz" >&5
 $as_echo_n "checking for __builtin_ctz... " >&6; }
diff --git a/configure.ac b/configure.ac
index fead9a6ce99..bdc7bcc2f9a 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1863,6 +1863,7 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_bswap32], [int x])
 PGAC_CHECK_BUILTIN_FUNC([__builtin_bswap64], [long int x])
 # We assume that we needn't test all widths of these explicitly:
 PGAC_CHECK_BUILTIN_FUNC([__builtin_clz], [unsigned int x])
+PGAC_CHECK_BUILTIN_FUNC([__builtin_clzl], [unsigned long x])
 PGAC_CHECK_BUILTIN_FUNC([__builtin_ctz], [unsigned int x])
 # __builtin_frame_address may draw a diagnostic for non-constant argument,
 # so it needs a different test function.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cdd826fbd3..c06c0874fce 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8840,9 +8840,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         displayed in <link linkend="monitoring-pg-stat-database-view">
         <structname>pg_stat_database</structname></link>,
         <link linkend="monitoring-pg-stat-io-view">
-        <structname>pg_stat_io</structname></link> (if <varname>object</varname>
-        is not <literal>wal</literal>), in the output of the
-        <link linkend="pg-stat-get-backend-io">
+        <structname>pg_stat_io</structname></link> and
+        <link linkend="monitoring-pg-stat-io-histogram-view">
+        <structname>pg_stat_io_histogram</structname></link>
+        (if <varname>object</varname> is not <literal>wal</literal>),
+        in the output of the <link linkend="pg-stat-get-backend-io">
         <function>pg_stat_get_backend_io()</function></link> function (if
         <varname>object</varname> is not <literal>wal</literal>), in the
         output of <xref linkend="sql-explain"/> when the <literal>BUFFERS</literal>
@@ -8872,7 +8874,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         measure the overhead of timing on your system.
         I/O timing information is displayed in
         <link linkend="monitoring-pg-stat-io-view">
-        <structname>pg_stat_io</structname></link> for the
+        <structname>pg_stat_io</structname></link> and
+        <link linkend="monitoring-pg-stat-io-histogram-view">
+        <structname>pg_stat_io_histogram</structname></link> for the
         <varname>object</varname> <literal>wal</literal> and in the output of
         the <link linkend="pg-stat-get-backend-io">
         <function>pg_stat_get_backend_io()</function></link> function for the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 462019a972c..f0fa759f532 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -509,6 +509,17 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io_histogram</structname><indexterm><primary>pg_stat_io_histogram</primary></indexterm></entry>
+      <entry>
+       One row for each combination of backend type, context, target object,
+       IO operation type and latency bucket (in microseconds) containing
+       cluster-wide I/O statistics.
+       See <link linkend="monitoring-pg-stat-io-histogram-view">
+       <structname>pg_stat_io_histogram</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -715,6 +726,8 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    Users are advised to use the <productname>PostgreSQL</productname>
    statistics views in combination with operating system utilities for a more
    complete picture of their database's I/O performance.
+   Furthermore the <structname>pg_stat_io_histogram</structname> view can be helpful
+   identifying latency outliers for specific I/O operations.
   </para>
 
  </sect2>
@@ -3283,6 +3296,284 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-io-histogram-view">
+  <title><structname>pg_stat_io_histogram</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io_histogram</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io_histogram</structname> view will contain one row for each
+   combination of backend type, target I/O object, and I/O context, IO operation
+   type, bucket latency cluster-wide I/O statistics. Combinations which do not make sense
+   are omitted.
+  </para>
+
+  <para>
+   The view shows measured perceived I/O latency by the backend, not the kernel or device
+   one. This is important distinction when troubleshooting, as the I/O latency observed by
+   the backend might get affected by:
+   <itemizedlist>
+     <listitem>
+        <para>OS scheduler decisions and available CPU resources.</para>
+        <para>With AIO, it might include time to service other IOs from the queue. That will often inflate IO latency.</para>
+        <para>In case of writing, additional filesystem journaling operations.</para>
+     </listitem>
+  </itemizedlist>
+  </para>
+
+  <para>
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
+  </para>
+
+  <table id="pg-stat-io-histogram-view" xreflabel="pg_stat_io_histogram">
+   <title><structname>pg_stat_io_histogram</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker). See <link
+        linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information
+        on <varname>backend_type</varname>s. Some
+        <varname>backend_type</varname>s do not accumulate I/O operation
+        statistics and will not be included in the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>object</structfield> <type>text</type>
+       </para>
+       <para>
+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: Permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>temp relation</literal>: Temporary relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
+       </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal>: The default or standard
+          <varname>context</varname> for a type of I/O operation. For
+          example, by default, relation data is read into and written out from
+          shared buffers. Thus, reads and writes of relation data to and from
+          shared buffers are tracked in <varname>context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations. Temporary
+          table vacuums use the same local buffer pool as other temporary table
+          I/O operations and are tracked in <varname>context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal>: Certain large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal>: Certain large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_type</structfield> <type>text</type>
+       </para>
+       <para>
+        The type of I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>evict</literal>: eviction from shared buffers cache.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>fsync</literal>: synchronization of modified kernel's
+          filesystem page cache with storage device.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>hit</literal>: shared buffers cache lookup hit.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>reuse</literal>: reuse of existing buffer in case of
+          reusing limited-space ring buffer (applies to <literal>bulkread</literal>,
+          <literal>bulkwrite</literal>, or <literal>vacuum</literal> contexts).
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>writeback</literal>: advise kernel that the described dirty
+          data should be flushed to disk preferably asynchronously.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>extend</literal>: add new zeroed blocks to the end of file.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>read</literal>: self explanatory.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>write</literal>: self explanatory.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>bucket_latency_us</structfield> <type>int4range</type>
+       </para>
+       <para>
+        The latency bucket (in microseconds).
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>bucket_count</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times latency of the I/O operation hit this specific bucket (with
+        up to <varname>bucket_latency_us</varname> microseconds).
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some backend types never perform I/O operations on some I/O objects and/or
+   in some I/O contexts. These rows might display zero bucket counts for such
+   specific operations.
+  </para>
+
+  <para>
+   <structname>pg_stat_io_histogram</structname> can be used to identify
+   I/O storage issues
+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      Presence of abnormally high latency for <varname>fsyncs</varname> might
+      indicate I/O saturation, oversubscription or hardware connectivity issues.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Unusually high latency for <varname>fsyncs</varname> on standby's startup
+      backend type, might be responsible for high duration of commits in
+      synchronous replication setups.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <note>
+   <para>
+    Columns tracking I/O wait time will only be non-zero when
+    <xref linkend="guc-track-io-timing"/> is enabled. The user should be
+    careful when referencing these columns in combination with their
+    corresponding I/O operations in case <varname>track_io_timing</varname>
+    was not enabled for the entire time since the last stats reset.
+   </para>
+  </note>
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-bgwriter-view">
   <title><structname>pg_stat_bgwriter</structname></title>
 
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f3b86b26be9..8b8c407e69f 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -832,8 +832,9 @@
    of times <function>XLogWrite</function> writes and
    <function>issue_xlog_fsync</function> syncs WAL data to disk are also
    counted as <varname>writes</varname> and <varname>fsyncs</varname>
-   in <structname>pg_stat_io</structname> for the <varname>object</varname>
-   <literal>wal</literal>, respectively.
+   in <structname>pg_stat_io</structname> and
+   <structname>pg_stat_io_histogram</structname> for the
+   <varname>object</varname> <literal>wal</literal>, respectively.
   </para>
 
   <para>
diff --git a/meson.build b/meson.build
index 46bd6b1468a..73da6040d23 100644
--- a/meson.build
+++ b/meson.build
@@ -2046,6 +2046,7 @@ builtins = [
   'bswap32',
   'bswap64',
   'clz',
+  'clzl',
   'ctz',
   'constant_p',
   'frame_address',
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 6d6dce18fa3..2a19b6ea5ba 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1249,6 +1249,17 @@ SELECT
        b.stats_reset
 FROM pg_stat_get_io() b;
 
+CREATE VIEW pg_stat_io_histogram AS
+SELECT
+       b.backend_type,
+       b.object,
+       b.context,
+       b.io_type,
+       b.bucket_latency_us,
+       b.bucket_count,
+       b.stats_reset
+FROM pg_stat_get_io_histogram() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 28de24538dc..148a2a9c7d5 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -17,6 +17,7 @@
 #include "postgres.h"
 
 #include "executor/instrument.h"
+#include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
 
@@ -107,6 +108,32 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+#define MIN_PG_STAT_IO_HIST_LATENCY 8191
+static inline int get_bucket_index(uint64_t ns) {
+	const uint32_t max_index = PGSTAT_IO_HIST_BUCKETS - 1;
+	/*
+	 * hopefully pre-calculated by the compiler:
+	 * clzl(8191) = clz(01111111111111b on uint64)
+	 */
+	const uint32_t min_latency_leading_zeros =
+		pg_leading_zero_bits64(MIN_PG_STAT_IO_HIST_LATENCY);
+
+	/*
+	 * make sure the tmp value has at least 8191 (our minimum bucket size)
+	 * as __builtin_clzl might return undefined behavior when operating on 0
+	 */
+	uint64_t tmp = ns | MIN_PG_STAT_IO_HIST_LATENCY;
+
+	/* count leading zeros */
+	int leading_zeros = pg_leading_zero_bits64(tmp);
+
+	/* normalize the index */
+	uint32_t index = min_latency_leading_zeros - leading_zeros;
+
+	/* clamp it to the maximum */
+	return (index > max_index) ? max_index : index;
+}
+
 /*
  * Like pgstat_count_io_op() except it also accumulates time.
  *
@@ -125,6 +152,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 	if (!INSTR_TIME_IS_ZERO(start_time))
 	{
 		instr_time	io_time;
+		int bucket_index;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
@@ -152,6 +180,10 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
 
+		/* calculate the bucket_index based on latency in nanoseconds (uint64) */
+		bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
+		PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
+
 		/* Add the per-backend count */
 		pgstat_count_backend_io_op_time(io_object, io_context, io_op,
 										io_time);
@@ -221,6 +253,10 @@ pgstat_io_flush_cb(bool nowait)
 
 				bktype_shstats->times[io_object][io_context][io_op] +=
 					INSTR_TIME_GET_MICROSEC(time);
+
+				for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
+					bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
+						PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
 			}
 		}
 	}
@@ -274,6 +310,33 @@ pgstat_get_io_object_name(IOObject io_object)
 	pg_unreachable();
 }
 
+const char *
+pgstat_get_io_op_name(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evict";
+		case IOOP_FSYNC:
+			return "fsync";
+		case IOOP_HIT:
+			return "hit";
+		case IOOP_REUSE:
+			return "reuse";
+		case IOOP_WRITEBACK:
+			return "writeback";
+		case IOOP_EXTEND:
+			return "extend";
+		case IOOP_READ:
+			return "read";
+		case IOOP_WRITE:
+			return "write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
 void
 pgstat_io_init_shmem_cb(void *stats)
 {
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5f907335990..b457d771474 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -18,6 +18,7 @@
 #include "access/xlog.h"
 #include "access/xlogprefetcher.h"
 #include "catalog/catalog.h"
+#include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -30,6 +31,7 @@
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/rangetypes.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
 #include "utils/wait_event.h"
@@ -1642,6 +1644,149 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS)
 	return (Datum) 0;
 }
 
+/*
+* When adding a new column to the pg_stat_io_histogram view and the
+* pg_stat_get_io_histogram() function, add a new enum value here above
+* HIST_IO_NUM_COLUMNS.
+*/
+typedef enum hist_io_stat_col
+{
+	HIST_IO_COL_INVALID = -1,
+	HIST_IO_COL_BACKEND_TYPE,
+	HIST_IO_COL_OBJECT,
+	HIST_IO_COL_CONTEXT,
+	HIST_IO_COL_IOTYPE,
+	HIST_IO_COL_BUCKET_US,
+	HIST_IO_COL_COUNT,
+	HIST_IO_COL_RESET_TIME,
+	HIST_IO_NUM_COLUMNS
+} histogram_io_stat_col;
+
+/*
+ * pg_stat_io_histogram_build_tuples
+ *
+ * Helper routine for pg_stat_get_io_histogram() and pg_stat_get_backend_io()
+ * filling a result tuplestore with one tuple for each object and each
+ * context supported by the caller, based on the contents of bktype_stats.
+ */
+static void
+pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
+						PgStat_BktypeIO *bktype_stats,
+						BackendType bktype,
+						TimestampTz stat_reset_timestamp)
+{
+	/* Get OID for int4range type */
+	Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+	Oid			range_typid = TypenameGetTypid("int4range");
+	TypeCacheEntry *typcache = lookup_type_cache(range_typid, TYPECACHE_RANGE_INFO);
+
+	for (int io_obj = 0; io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+	{
+		const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *context_name = pgstat_get_io_context_name(io_context);
+
+			/*
+			 * Some combinations of BackendType, IOObject, and IOContext are
+			 * not valid for any type of IOOp. In such cases, omit the entire
+			 * row from the view.
+			 */
+			if (!pgstat_tracks_io_object(bktype, io_obj, io_context))
+				continue;
+
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				const char *op_name = pgstat_get_io_op_name(io_op);
+
+				for(int bucket = 0; bucket < PGSTAT_IO_HIST_BUCKETS; bucket++) {
+					Datum		values[HIST_IO_NUM_COLUMNS] = {0};
+					bool		nulls[HIST_IO_NUM_COLUMNS] = {0};
+					RangeBound	lower, upper;
+					RangeType	*range;
+
+					values[HIST_IO_COL_BACKEND_TYPE] = bktype_desc;
+					values[HIST_IO_COL_OBJECT] = CStringGetTextDatum(obj_name);
+					values[HIST_IO_COL_CONTEXT] = CStringGetTextDatum(context_name);
+					values[HIST_IO_COL_IOTYPE] = CStringGetTextDatum(op_name);
+
+					/* bucket's maximum latency as range in microseconds */
+					if(bucket == 0)
+						lower.val = Int32GetDatum(0);
+					else
+						lower.val = Int32GetDatum(1 << (2 + bucket));
+					lower.infinite = false;
+					lower.inclusive = true;
+					lower.lower = true;
+
+					if(bucket == PGSTAT_IO_HIST_BUCKETS - 1)
+						upper.infinite = true;
+					else {
+						upper.val = Int32GetDatum(1 << (2 + bucket + 1));
+						upper.infinite = false;
+					}
+					upper.inclusive = false;
+					upper.lower = false;
+
+					range = make_range(typcache, &lower, &upper, false, NULL);
+					values[HIST_IO_COL_BUCKET_US] = RangeTypePGetDatum(range);
+
+					/* bucket count */
+					values[HIST_IO_COL_COUNT] = Int64GetDatum(
+						bktype_stats->hist_time_buckets[io_obj][io_context][io_op][bucket]);
+
+					if (stat_reset_timestamp != 0)
+						values[HIST_IO_COL_RESET_TIME] = TimestampTzGetDatum(stat_reset_timestamp);
+					else
+						nulls[HIST_IO_COL_RESET_TIME] = true;
+
+					tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+								 values, nulls);
+				}
+			}
+		}
+	}
+}
+
+Datum
+pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * In Assert builds, we can afford an extra loop through all of the
+		 * counters (in pg_stat_io_build_tuples()), checking that only
+		 * expected stats are non-zero, since it keeps the non-Assert code
+		 * cleaner.
+		 */
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		if (!pgstat_tracks_io_bktype(bktype))
+			continue;
+
+		/* save tuples with data from this PgStat_BktypeIO */
+		pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, bktype,
+								backends_io_stats->stat_reset_timestamp);
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * pg_stat_wal_build_tuple
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fc8d82665b8..5a13ca9b8a3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6041,6 +6041,15 @@
   proargnames => '{backend_type,object,context,reads,read_bytes,read_time,writes,write_bytes,write_time,writebacks,writeback_time,extends,extend_bytes,extend_time,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
+{ oid => '6149', descr => 'statistics: per backend type IO latency histogram',
+  proname => 'pg_stat_get_io_histogram', prorows => '30', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record',
+  proargtypes => '',
+  proallargtypes => '{text,text,text,text,int4range,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,object,context,io_type,bucket_latency_us,bucket_count,stats_reset}',
+  prosrc => 'pg_stat_get_io_histogram' },
+
 { oid => '6386', descr => 'statistics: backend IO statistics',
   proname => 'pg_stat_get_backend_io', prorows => '5', proretset => 't',
   provolatile => 'v', proparallel => 'r', prorettype => 'record',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 216b93492ba..f9eece29572 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -325,11 +325,23 @@ typedef enum IOOp
 	(((unsigned int) (io_op)) < IOOP_NUM_TYPES && \
 	 ((unsigned int) (io_op)) >= IOOP_EXTEND)
 
+/*
+ * This should represent balance between being fast and providing value
+ * to the users:
+ * 1. We want to cover various fast and slow device types (0.01ms - 15ms)
+ * 2. We want to also cover sporadic long tail latencies (hardware issues,
+ *    delayed fsyncs, stuck I/O)
+ * 3. We want to be as small as possible here in terms of size:
+ *    16 * sizeof(uint64) = which should be less than two cachelines.
+ */
+#define PGSTAT_IO_HIST_BUCKETS 16
+
 typedef struct PgStat_BktypeIO
 {
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	uint64		hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_PendingIO
@@ -337,6 +349,7 @@ typedef struct PgStat_PendingIO
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	uint64		pending_hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_PendingIO;
 
 typedef struct PgStat_IO
@@ -609,6 +622,7 @@ extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
+extern const char *pgstat_get_io_op_name(IOOp io_op);
 
 extern bool pgstat_tracks_io_bktype(BackendType bktype);
 extern bool pgstat_tracks_io_object(BackendType bktype,
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 0bca559caaa..f00780ed312 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -32,6 +32,42 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+
+/*
+ * pg_leading_zero_bits64
+ *		Returns the number of leading 0-bits in x, starting at the most significant bit position.
+ *		Word must not be 0 (as it is undefined behavior).
+ */
+static inline int
+pg_leading_zero_bits64(uint64 word)
+{
+#ifdef HAVE__BUILTIN_CLZL
+	Assert(word != 0);
+
+#if SIZEOF_LONG == 8
+	return __builtin_clzl(word);
+#elif SIZEOF_LONG_LONG == 8
+	return __builtin_clzll(word);
+#else
+#error "cannot find integer type of the same size as uint64_t"
+#endif
+
+#else
+	uint64 y;
+	int n = 64;
+	if (word == 0)
+		return 64;
+
+	y = word >> 32; if (y != 0) { n -= 32; word = y; }
+	y = word >> 16; if (y != 0) { n -= 16; word = y; }
+	y = word >> 8;  if (y != 0) { n -= 8;  word = y; }
+	y = word >> 4;  if (y != 0) { n -= 4;  word = y; }
+	y = word >> 2;  if (y != 0) { n -= 2;  word = y; }
+	y = word >> 1;  if (y != 0) { return n - 2; }
+	return n - 1;
+#endif
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
@@ -71,7 +107,7 @@ pg_leftmost_one_pos32(uint32 word)
 static inline int
 pg_leftmost_one_pos64(uint64 word)
 {
-#ifdef HAVE__BUILTIN_CLZ
+#ifdef HAVE__BUILTIN_CLZL
 	Assert(word != 0);
 
 #if SIZEOF_LONG == 8
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 9ed0a1756c0..5b5a8d6defb 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1952,6 +1952,14 @@ pg_stat_io| SELECT backend_type,
     fsync_time,
     stats_reset
    FROM pg_stat_get_io() b(backend_type, object, context, reads, read_bytes, read_time, writes, write_bytes, write_time, writebacks, writeback_time, extends, extend_bytes, extend_time, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
+pg_stat_io_histogram| SELECT backend_type,
+    object,
+    context,
+    io_type,
+    bucket_latency_us,
+    bucket_count,
+    stats_reset
+   FROM pg_stat_get_io_histogram() b(backend_type, object, context, io_type, bucket_latency_us, bucket_count, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index b99462bf946..1dec1348ab1 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1813,6 +1813,29 @@ SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset;
  t
 (1 row)
 
+-- Check that pg_stat_io_histograms sees some growing counts in buckets
+-- We could also try with checkpointer, but it often runs with fsync=off
+-- during test.
+SET track_io_timing TO 'on';
+SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram()
+WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
+CREATE TABLE test_io_hist(id bigint);
+INSERT INTO test_io_hist SELECT generate_series(1, 100) s;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(bucket_count) AS hist_bucket_count_sum2 FROM pg_stat_get_io_histogram()
+WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
+SELECT :hist_bucket_count_sum2 > :hist_bucket_count_sum;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET track_io_timing;
 -- Check invalid input for pg_stat_get_backend_io()
 SELECT pg_stat_get_backend_io(NULL);
  pg_stat_get_backend_io 
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 941222cf0be..b6405fb2e8d 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -861,6 +861,21 @@ SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) +
   FROM pg_stat_get_backend_io(pg_backend_pid()) \gset
 SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset;
 
+
+-- Check that pg_stat_io_histograms sees some growing counts in buckets
+-- We could also try with checkpointer, but it often runs with fsync=off
+-- during test.
+SET track_io_timing TO 'on';
+SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram()
+WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
+CREATE TABLE test_io_hist(id bigint);
+INSERT INTO test_io_hist SELECT generate_series(1, 100) s;
+SELECT pg_stat_force_next_flush();
+SELECT sum(bucket_count) AS hist_bucket_count_sum2 FROM pg_stat_get_io_histogram()
+WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
+SELECT :hist_bucket_count_sum2 > :hist_bucket_count_sum;
+RESET track_io_timing;
+
 -- Check invalid input for pg_stat_get_backend_io()
 SELECT pg_stat_get_backend_io(NULL);
 SELECT pg_stat_get_backend_io(0);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 52f8603a7be..f441fd32661 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3778,6 +3778,7 @@ gtrgm_consistent_cache
 gzFile
 heap_page_items_state
 help_handler
+histogram_io_stat_col
 hlCheck
 hstoreCheckKeyLen_t
 hstoreCheckValLen_t
-- 
2.43.0

From d70e793bc38055a7fdfd74e8110afa29f74a8d83 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 6 Mar 2026 12:09:10 +0100
Subject: [PATCH v8 4/6] Convert PgStat_IO to pointer to avoid huge static
 memory allocation if not used.

---
 src/backend/utils/activity/pgstat.c    |  9 ++++++++-
 src/backend/utils/activity/pgstat_io.c | 14 +++++++++++---
 src/include/utils/pgstat_internal.h    |  2 +-
 3 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index f015f217766..d61c50a4aef 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1644,10 +1644,17 @@ pgstat_write_statsfile(void)
 
 		pgstat_build_snapshot_fixed(kind);
 		if (pgstat_is_kind_builtin(kind))
-			ptr = ((char *) &pgStatLocal.snapshot) + info->snapshot_ctl_off;
+		{
+			if(kind == PGSTAT_KIND_IO)
+				ptr = (char *) pgStatLocal.snapshot.io;
+			else
+				ptr = ((char *) &pgStatLocal.snapshot) + info->snapshot_ctl_off;
+		}
 		else
 			ptr = pgStatLocal.snapshot.custom_data[kind - PGSTAT_KIND_CUSTOM_MIN];
 
+		Assert(ptr != NULL);
+
 		fputc(PGSTAT_FILE_ENTRY_FIXED, fpout);
 		pgstat_write_chunk_s(fpout, &kind);
 		pgstat_write_chunk(fpout, ptr, info->shared_data_len);
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index ae689d3926e..8605ea65605 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -19,6 +19,7 @@
 #include "executor/instrument.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
 
 PgStat_PendingIO PendingIOStats;
@@ -199,7 +200,7 @@ pgstat_fetch_stat_io(void)
 {
 	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
 
-	return &pgStatLocal.snapshot.io;
+	return pgStatLocal.snapshot.io;
 }
 
 /*
@@ -348,6 +349,9 @@ pgstat_io_init_shmem_cb(void *stats)
 
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 		LWLockInitialize(&stat_shmem->locks[i], LWTRANCHE_PGSTATS_DATA);
+
+	/* this might end up being lazily allocated in pgstat_io_snapshot_cb() */
+	pgStatLocal.snapshot.io = NULL;
 }
 
 void
@@ -375,11 +379,15 @@ pgstat_io_reset_all_cb(TimestampTz ts)
 void
 pgstat_io_snapshot_cb(void)
 {
+	if (unlikely(pgStatLocal.snapshot.io == NULL))
+		pgStatLocal.snapshot.io = MemoryContextAllocZero(TopMemoryContext,
+				sizeof(PgStat_IO));
+
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
 		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
 		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
-		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io->stats[i];
 
 		LWLockAcquire(bktype_lock, LW_SHARED);
 
@@ -388,7 +396,7 @@ pgstat_io_snapshot_cb(void)
 		 * the reset timestamp as well.
 		 */
 		if (i == 0)
-			pgStatLocal.snapshot.io.stat_reset_timestamp =
+			pgStatLocal.snapshot.io->stat_reset_timestamp =
 				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
 
 		/* using struct assignment due to better type safety */
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9b8fbae00ed..407657e060c 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -600,7 +600,7 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
-	PgStat_IO	io;
+	PgStat_IO	*io;
 
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
-- 
2.43.0

From 4a1f2e1caaf7af975fdddc7d84f427f5b1cac5e4 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 6 Mar 2026 11:26:19 +0100
Subject: [PATCH v8 3/6] PendingIOStats save memory

---
 src/backend/utils/activity/pgstat.c      | 10 ++++++++
 src/backend/utils/activity/pgstat_io.c   | 20 +++++++++-------
 src/include/pgstat.h                     |  8 ++++++-
 src/test/recovery/t/029_stats_restart.pl | 29 ++++++++++++++++++++++++
 src/test/regress/expected/stats.out      | 23 -------------------
 src/test/regress/sql/stats.sql           | 15 ------------
 6 files changed, 58 insertions(+), 47 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 11bb71cad5a..f015f217766 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -104,8 +104,10 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "access/xlog.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
+#include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -671,6 +673,14 @@ pgstat_initialize(void)
 	/* Set up a process-exit hook to clean up */
 	before_shmem_exit(pgstat_shutdown_hook, 0);
 
+	/* Allocate I/O latency buckets only if we are going to populate it */
+	if (track_io_timing || track_wal_io_timing)
+		PendingIOStats.pending_hist_time_buckets = MemoryContextAllocZero(TopMemoryContext,
+																		  IOOBJECT_NUM_TYPES * IOCONTEXT_NUM_TYPES * IOOP_NUM_TYPES *
+																		  PGSTAT_IO_HIST_BUCKETS * sizeof(uint64));
+	else
+		PendingIOStats.pending_hist_time_buckets = NULL;
+
 #ifdef USE_ASSERT_CHECKING
 	pgstat_is_initialized = true;
 #endif
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 148a2a9c7d5..ae689d3926e 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -21,7 +21,7 @@
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
 
-static PgStat_PendingIO PendingIOStats;
+PgStat_PendingIO PendingIOStats;
 static bool have_iostats = false;
 
 /*
@@ -180,9 +180,11 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
 
-		/* calculate the bucket_index based on latency in nanoseconds (uint64) */
-		bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
-		PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
+		if(PendingIOStats.pending_hist_time_buckets != NULL) {
+			/* calculate the bucket_index based on latency in nanoseconds (uint64) */
+			bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
+			PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
+		}
 
 		/* Add the per-backend count */
 		pgstat_count_backend_io_op_time(io_object, io_context, io_op,
@@ -254,9 +256,10 @@ pgstat_io_flush_cb(bool nowait)
 				bktype_shstats->times[io_object][io_context][io_op] +=
 					INSTR_TIME_GET_MICROSEC(time);
 
-				for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
-					bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
-						PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
+				if(PendingIOStats.pending_hist_time_buckets != NULL)
+					for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
+						bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
+							PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
 			}
 		}
 	}
@@ -265,7 +268,8 @@ pgstat_io_flush_cb(bool nowait)
 
 	LWLockRelease(bktype_lock);
 
-	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+	/* Avoid overwriting latency buckets array pointer */
+	memset(&PendingIOStats, 0, offsetof(PgStat_PendingIO, pending_hist_time_buckets));
 
 	have_iostats = false;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0e689e0a730..34a0ece0dbb 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -349,9 +349,15 @@ typedef struct PgStat_PendingIO
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
-	uint64		pending_hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
+	/*
+	 * Dynamically allocated array of [IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES]
+	 * [IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS] only with track_io_timings true.
+	 */
+	uint64		(*pending_hist_time_buckets)[IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_PendingIO;
 
+extern PgStat_PendingIO PendingIOStats;
+
 typedef struct PgStat_IO
 {
 	TimestampTz stat_reset_timestamp;
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index cdc427dbc78..33939c8701a 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -293,7 +293,36 @@ cmp_ok(
 	$wal_restart_immediate->{reset},
 	"$sect: reset timestamp is new");
 
+
+## Test pg_stat_io_histogram that is becoming active due to dynamic memory
+## allocation only for new backends with globally set track_[io|wal_io]_timing
+$sect = "pg_stat_io_histogram";
+$node->append_conf('postgresql.conf', "track_io_timing = 'on'");
+$node->append_conf('postgresql.conf', "track_wal_io_timing = 'on'");
+$node->restart;
+
+
+## Check that pg_stat_io_histograms sees some growing counts in buckets
+## We could also try with checkpointer, but it often runs with fsync=off
+## during test.
+my $countbefore = $node->safe_psql('postgres',
+	"SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram() " .
+	"WHERE backend_type='client backend' AND object='relation' AND context='normal'");
+
+$node->safe_psql('postgres', "CREATE TABLE test_io_hist(id bigint);");
+$node->safe_psql('postgres', "INSERT INTO test_io_hist SELECT generate_series(1, 100) s;");
+$node->safe_psql('postgres', "SELECT pg_stat_force_next_flush();");
+
+my $countafter = $node->safe_psql('postgres',
+	"SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram() " .
+	"WHERE backend_type='client backend' AND object='relation' AND context='normal'");
+
+cmp_ok(
+	$countafter, '>', $countbefore,
+	"pg_stat_io_histogram: latency buckets growing");
+
 $node->stop;
+
 done_testing();
 
 sub trigger_funcrel_stat
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 1dec1348ab1..b99462bf946 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1813,29 +1813,6 @@ SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset;
  t
 (1 row)
 
--- Check that pg_stat_io_histograms sees some growing counts in buckets
--- We could also try with checkpointer, but it often runs with fsync=off
--- during test.
-SET track_io_timing TO 'on';
-SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram()
-WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
-CREATE TABLE test_io_hist(id bigint);
-INSERT INTO test_io_hist SELECT generate_series(1, 100) s;
-SELECT pg_stat_force_next_flush();
- pg_stat_force_next_flush 
---------------------------
- 
-(1 row)
-
-SELECT sum(bucket_count) AS hist_bucket_count_sum2 FROM pg_stat_get_io_histogram()
-WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
-SELECT :hist_bucket_count_sum2 > :hist_bucket_count_sum;
- ?column? 
-----------
- t
-(1 row)
-
-RESET track_io_timing;
 -- Check invalid input for pg_stat_get_backend_io()
 SELECT pg_stat_get_backend_io(NULL);
  pg_stat_get_backend_io 
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index b6405fb2e8d..941222cf0be 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -861,21 +861,6 @@ SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) +
   FROM pg_stat_get_backend_io(pg_backend_pid()) \gset
 SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset;
 
-
--- Check that pg_stat_io_histograms sees some growing counts in buckets
--- We could also try with checkpointer, but it often runs with fsync=off
--- during test.
-SET track_io_timing TO 'on';
-SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram()
-WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
-CREATE TABLE test_io_hist(id bigint);
-INSERT INTO test_io_hist SELECT generate_series(1, 100) s;
-SELECT pg_stat_force_next_flush();
-SELECT sum(bucket_count) AS hist_bucket_count_sum2 FROM pg_stat_get_io_histogram()
-WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
-SELECT :hist_bucket_count_sum2 > :hist_bucket_count_sum;
-RESET track_io_timing;
-
 -- Check invalid input for pg_stat_get_backend_io()
 SELECT pg_stat_get_backend_io(NULL);
 SELECT pg_stat_get_backend_io(0);
-- 
2.43.0

From 03639d22f4cbfc1a759b9ea98bba5c78f17ae2ea Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 6 Mar 2026 14:00:38 +0100
Subject: [PATCH v8 6/6] Further condense and reduce memory used by
 pgstat_io(_histogram) subsystem by eliminating tracking of useless backend
 types: autovacum launcher and standalone backend.

---
 src/backend/utils/activity/pgstat_io.c   | 17 +++++++++++------
 src/include/pgstat.h                     |  2 +-
 src/test/recovery/t/029_stats_restart.pl |  5 -----
 src/test/regress/expected/stats.out      | 14 +-------------
 4 files changed, 13 insertions(+), 25 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 1e9bff4da41..6c11430ad94 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -73,6 +73,8 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op,
 	Assert((unsigned int) io_object < IOOBJECT_NUM_TYPES);
 	Assert((unsigned int) io_context < IOCONTEXT_NUM_TYPES);
 	Assert(pgstat_is_ioop_tracked_in_bytes(io_op) || bytes == 0);
+	if(unlikely(MyBackendType == B_STANDALONE_BACKEND || MyBackendType == B_AUTOVAC_LAUNCHER))
+		return;
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
 	PendingIOStats.counts[io_object][io_context][io_op] += cnt;
@@ -425,6 +427,9 @@ pgstat_io_snapshot_cb(void)
 * - Syslogger because it is not connected to shared memory
 * - Archiver because most relevant archiving IO is delegated to a
 *   specialized command or module
+* - Autovacum launcher because it hardly performs any IO
+* - Standalone backend as it is only used in unusual maintenance
+*   scenarios
 *
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
@@ -446,9 +451,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
+		case B_AUTOVAC_LAUNCHER:
+		case B_STANDALONE_BACKEND:
 			return false;
 
-		case B_AUTOVAC_LAUNCHER:
 		case B_AUTOVAC_WORKER:
 		case B_BACKEND:
 		case B_BG_WORKER:
@@ -456,7 +462,6 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_CHECKPOINTER:
 		case B_IO_WORKER:
 		case B_SLOTSYNC_WORKER:
-		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
 		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
@@ -482,20 +487,20 @@ pgstat_remap_condensed_bktype(BackendType bktype) {
 		-1, /* B_INVALID */
 		0,
 		-1, /* B_DEAD_END_BACKEND */
+		-1, /* B_AUTOVAC_LAUNCHER */
 		1,
 		2,
 		3,
 		4,
+		-1, /* B_STANDALONE_BACKEND */
+		-1, /* B_ARCHIVER */
 		5,
 		6,
-		-1, /* B_ARCHIVER */
 		7,
 		8,
-		8,
+		9,
 		10,
 		11,
-		12,
-		13,
 		-1  /* B_LOGGER */
 	};
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bca588a9dad..554ae87278c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -359,7 +359,7 @@ typedef struct PgStat_PendingIO
 extern PgStat_PendingIO PendingIOStats;
 
 /* This needs to stay in sync with pgstat_tracks_io_bktype() */
-#define PGSTAT_USED_BACKEND_NUM_TYPES BACKEND_NUM_TYPES - 4
+#define PGSTAT_USED_BACKEND_NUM_TYPES BACKEND_NUM_TYPES - 6
 typedef struct PgStat_IO
 {
 	TimestampTz stat_reset_timestamp;
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index 33939c8701a..681fb9ac16d 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -22,12 +22,7 @@ my $sect = "startup";
 
 # Check some WAL statistics after a fresh startup.  The startup process
 # should have done WAL reads, and initialization some WAL writes.
-my $standalone_io_stats = io_stats('init', 'wal', 'standalone backend');
 my $startup_io_stats = io_stats('normal', 'wal', 'startup');
-cmp_ok(
-	'0', '<',
-	$standalone_io_stats->{writes},
-	"$sect: increased standalone backend IO writes");
 cmp_ok(
 	'0', '<',
 	$startup_io_stats->{reads},
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index b99462bf946..6f99c10fdb3 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -16,11 +16,6 @@ SHOW track_counts;  -- must be on
 SELECT backend_type, object, context FROM pg_stat_io
   ORDER BY backend_type COLLATE "C", object COLLATE "C", context COLLATE "C";
 backend_type|object|context
-autovacuum launcher|relation|bulkread
-autovacuum launcher|relation|init
-autovacuum launcher|relation|normal
-autovacuum launcher|wal|init
-autovacuum launcher|wal|normal
 autovacuum worker|relation|bulkread
 autovacuum worker|relation|init
 autovacuum worker|relation|normal
@@ -67,13 +62,6 @@ slotsync worker|relation|vacuum
 slotsync worker|temp relation|normal
 slotsync worker|wal|init
 slotsync worker|wal|normal
-standalone backend|relation|bulkread
-standalone backend|relation|bulkwrite
-standalone backend|relation|init
-standalone backend|relation|normal
-standalone backend|relation|vacuum
-standalone backend|wal|init
-standalone backend|wal|normal
 startup|relation|bulkread
 startup|relation|bulkwrite
 startup|relation|init
@@ -95,7 +83,7 @@ walsummarizer|wal|init
 walsummarizer|wal|normal
 walwriter|wal|init
 walwriter|wal|normal
-(79 rows)
+(67 rows)
 \a
 -- ensure that both seqscan and indexscan plans are allowed
 SET enable_seqscan TO on;
-- 
2.43.0

Reply via email to