Attached patch set is substantially different enough from previous
versions that I kept it as a new patch set.
Note that local buffer allocations are now correctly tracked.

On Tue, Jul 12, 2022 at 1:01 PM Andres Freund <and...@anarazel.de> wrote:

> Hi,
>
> On 2022-07-12 12:19:06 -0400, Melanie Plageman wrote:
> > > > I also realized that I am not differentiating between IOPATH_SHARED
> and
> > > > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what
> type
> > > > of buffer we are fsync'ing by the time we call
> register_dirty_segment(),
> > > > I'm not sure how we would fix this.
> > >
> > > I think there scarcely happens flush for strategy-loaded buffers.  If
> > > that is sensible, IOOP_FSYNC would not make much sense for
> > > IOPATH_STRATEGY.
> > >
> >
> > Why would it be less likely for a backend to do its own fsync when
> > flushing a dirty strategy buffer than a regular dirty shared buffer?
>
> We really just don't expect a backend to do many segment fsyncs at
> all. Otherwise there's something wrong with the forwarding mechanism.
>

When a dirty strategy buffer is written out, if pendingOps sync queue is
full and the backend has to fsync the segment itself instead of relying
on the checkpointer, this will show in the statistics as an IOOP_FSYNC
for IOPATH_SHARED not IOPATH_STRATEGY.
IOPATH_STRATEGY + IOOP_FSYNC will always be 0 for all BackendTypes.
Does this seem right?


>
> It'd be different if we tracked WAL fsyncs more granularly - which would be
> quite interesting - but that's something for another day^Wpatch.
>
>
I do have a question about this.
So, if we were to start tracking WAL IO would it fit within this
paradigm to have a new IOPATH_WAL for WAL or would it add a separate
dimension?

I was thinking that we might want to consider calling this view
pg_stat_io_data because we might want to have a separate view,
pg_stat_io_wal and then, maybe eventually, convert pg_stat_slru to
pg_stat_io_slru (or a subset of what is in pg_stat_slru).
And maybe then later add pg_stat_io_[archiver/other]

Is pg_stat_io_data a good name that gives us flexibility to
introduce views which expose per-backend IO operation stats (maybe that
goes in pg_stat_activity, though [or maybe not because it wouldn't
include exited backends?]) and per query IO operation stats?

I would like to add roughly the same additional columns to all of
these during AIO development (basically the columns from iostat):
- average block size (will usually be 8kB for pg_stat_io_data but won't
necessarily for the others)
- IOPS/BW
- avg read/write wait time
- demand rate/completion rate
- merges
- maybe queue depth

And I would like to be able to see all of these per query, per backend,
per relation, per BackendType, per IOPath, per SLRU type, etc.

Basically, what I'm asking is
1) what can we name the view to enable these future stats to exist with
the least confusing/wordy view names?
2) will the current view layout and column titles work with minimal
changes for future stats extensions like what I mention above?


>
> > > > > Wonder if it's worth making the lock specific to the backend type?
> > > > >
> > > >
> > > > I've added another Lock into PgStat_IOPathOps so that each
> BackendType
> > > > can be locked separately. But, I've also kept the lock in
> > > > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
> > > > done easily.
> > >
> > > Looks fine about the lock separation.
> > >
> >
> > Actually, I think it is not safe to use both of these locks. So for
> > picking one method, it is probably better to go with the locks in
> > PgStat_IOPathOps, it will be more efficient for flush (and not for
> > fetching and resetting), so that is probably the way to go, right?
>
> I think it's good to just use one kind of lock, and efficiency of
> snapshotting
> / resetting is nearly irrelevant. But I don't see why it's not safe to use
> both kinds of locks?
>
>
The way I implemented it was not safe because I didn't use both locks
when resetting the stats.

In this new version of the patch, I've done the following: In shared
memory I've put the lock in PgStatShared_IOPathOps -- the data structure
which contains an array of PgStat_IOOpCounters for all IOOp types for
all IOPaths. Thus, different BackendType + IOPath combinations can be
updated concurrently without contending for the same lock.

To make this work, I made two versions of the PgStat_IOPathOps -- one
that has the lock, PgStatShared_IOPathOps, and one without,
PgStat_IOPathOps, so that I can persist it to the stats file without
writing and reading the LWLock and can have a local and snapshot version
of the data structure without the lock.

This also necessitated two versions of the data structure wrapping
PgStat_IOPathOps, PgStat_BackendIOPathOps, which contains an array with
a PgStat_IOPathOps for each BackendType, and
PgStatShared_BackendIOPathOps, containing an array of
PgStatShared_IOPathOps.


>
> > > Looks fine, but I think pgstat_flush_io_ops() need more comments like
> > > other pgstat_flush_* functions.
> > >
> > > +       for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> > > +               stats_shmem->stats[i].stat_reset_timestamp = ts;
> > >
> > > I'm not sure we need a separate reset timestamp for each backend type
> > > but SLRU counter does the same thing..
> > >
> >
> > Yes, I think for SLRU stats it is because you can reset individual SLRU
> > stats. Also there is no wrapper data structure to put it in. I could
> > keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
> > operation stats at once, but I am thinking of getting rid of
> > PgStatShared_BackendIOPathOps since it is not needed if I only keep the
> > locks in PgStat_IOPathOps and make the global shared value an array of
> > PgStat_IOPathOps.
>
> I'm strongly against introducing super granular reset timestamps. I think
> that
> was a mistake for SLRU stats, but we can't fix that as easily.
>
>
Since all stats in pg_stat_io must be reset at the same time, I've put
the reset timestamp can in the PgStat[Shared]_BackendIOPathOps and
removed it from each PgStat[Shared]_IOPathOps.


>
> > Currently, strategy allocs count only reuses of a strategy buffer (not
> > initial shared buffers which are added to the ring).
> > strategy writes count only the writing out of dirty buffers which are
> > already in the ring and are being reused.
>
> That seems right to me.
>
>
> > Alternatively, we could also count as strategy allocs all those buffers
> > which are added to the ring and count as strategy writes  all those
> > shared buffers which are dirty when initially added to the ring.
>
> I don't think that'd provide valuable information. The whole reason that
> strategy writes are interesting is that they can lead to writing out data a
> lot sooner than they would be written out without a strategy being used.
>
>
Then I agree that strategy writes should only count strategy buffers
that are written out in order to reuse the buffer (which is in lieu of
getting a new, potentially clean, shared buffer). This patch implements
that behavior.

However, for strategy allocs, it seems like we would want to count all
demand for buffers as part of a BufferAccessStrategy. So, that would
include allocating buffers to initially fill the ring, allocations of
new shared buffers after the ring was already full that are added to the
ring because all existing buffers in the ring are pinned, and buffers
already in the ring which are being reused.

This version of the patch only counts the third scenario as a strategy
allocation, but I think it would make more sense to count all three as
strategy allocs.

The downside of this behavior is that strategy allocs count different
scenarios than strategy writes, reads, and extends. But, I think that
this is okay.

I'll clarify it in the docs once there is a decision.

Also, note that, as stated above, there will never be any strategy
fsyncs (that is, IOPATH_STRATEGY + IOOP_FSYNC will always be 0) because
the code path starting with register_dirty_segment() which ends with a
regular backend doing its own fsync when pendingOps is full does not
know what the current IOPATH is and checkpointer does not use a
BufferAccessStrategy.


>
> > Subject: [PATCH v24 2/3] Track IO operation statistics
> >
> > Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
> > location or type of IO done by a backend. For example, the checkpointer
> > may write a shared buffer out. This would be counted as an IOOp write on
> > an IOPath IOPATH_SHARED by BackendType "checkpointer".
>
> I'm still not 100% happy with IOPath - seems a bit too easy to confuse with
> the file path. What about 'origin'?
>
>
Enough has changed in this version of the patch that I decided to defer
renaming until some of the other issues are resolved.


>
> > Each IOOp (alloc, fsync, extend, write) is counted per IOPath
> > (direct, local, shared, or strategy) through a call to
> > pgstat_count_io_op().
>
> It seems we should track reads too - it's quite interesting to know whether
> reads happened because of a strategy, for example. You do reference reads
> in a
> later part of the commit message even :)
>

I've added reads to what is counted.


>
> > The primary concern of these statistics is IO operations on data blocks
> > during the course of normal database operations. IO done by, for
> > example, the archiver or syslogger is not counted in these statistics.
>
> We could extend this at a later stage, if we really want to. But I'm not
> sure
> it's interesting or fully possible. E.g. the archiver's write are largely
> not
> done by the archiver itself, but by a command (or module these days) it
> shells
> out to.
>

I've added note of this to some of the comments and the commit message.
I also omit rows for these BackendTypes from the view. See my later
comment in this email for more detail on that.


>
> > Note that this commit does not add code to increment IOPATH_DIRECT. A
> > future patch adding wrappers for smgrwrite(), smgrextend(), and
> > smgrimmedsync() would provide a good location to call
> > pgstat_count_io_op() for unbuffered IO and avoid regressions for future
> > users of these functions.
>
> Hm. Perhaps we should defer introducing IOPATH_DIRECT for now then?
>
>
It's gone.


>
> > Stats on IOOps for all IOPaths for a backend are initially accumulated
> > locally.
> >
> > Later they are flushed to shared memory and accumulated with those from
> > all other backends, exited and live.
>
> Perhaps mention here that this later could be extended to make
> per-connection
> stats visible?
>
>
Mentioned.


>
> > Some BackendTypes will not execute pgstat_report_stat() and thus must
> > explicitly call pgstat_flush_io_ops() in order to flush their backend
> > local IO operation statistics to shared memory.
>
> Maybe add "flush ... during ongoing operation" or such? Because they'd all
> flush at commit, IIRC.
>
>
Added.


>
> > diff --git a/src/backend/bootstrap/bootstrap.c
> b/src/backend/bootstrap/bootstrap.c
> > index 088556ab54..963b05321e 100644
> > --- a/src/backend/bootstrap/bootstrap.c
> > +++ b/src/backend/bootstrap/bootstrap.c
> > @@ -33,6 +33,7 @@
> >  #include "miscadmin.h"
> >  #include "nodes/makefuncs.h"
> >  #include "pg_getopt.h"
> > +#include "pgstat.h"
> >  #include "storage/bufmgr.h"
> >  #include "storage/bufpage.h"
> >  #include "storage/condition_variable.h"
>
> Hm?
>

Removed


>
> > diff --git a/src/backend/postmaster/walwriter.c
> b/src/backend/postmaster/walwriter.c
> > index e926f8c27c..beb46dcb55 100644
> > --- a/src/backend/postmaster/walwriter.c
> > +++ b/src/backend/postmaster/walwriter.c
> > @@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
> >       }
> >
> >       if (ShutdownRequestPending)
> > -     {
> > -             /*
> > -              * Force reporting remaining WAL statistics at process
> exit.
> > -              *
> > -              * Since pgstat_report_wal is invoked with 'force' is
> false in main
> > -              * loop to avoid overloading the cumulative stats system,
> there may
> > -              * exist unreported stats counters for the WAL writer.
> > -              */
> > -             pgstat_report_wal(true);
> > -
> >               proc_exit(0);
> > -     }
> >
> >       /* Perform logging of memory contexts of this process */
> >       if (LogMemoryContextPending)
>
> Let's do this in a separate commit and get it out of the way...
>
>
I've put it in a separate commit.


>
> > @@ -682,16 +694,37 @@ AddBufferToRing(BufferAccessStrategy strategy,
> BufferDesc *buf)
> >   * if this buffer should be written and re-used.
> >   */
> >  bool
> > -StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
> > +StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf,
> bool *write_from_ring)
> >  {
> > -     /* We only do this in bulkread mode */
> > +
> > +     /*
> > +      * We only reject reusing and writing out the strategy buffer this
> in
> > +      * bulkread mode.
> > +      */
> >       if (strategy->btype != BAS_BULKREAD)
> > +     {
> > +             /*
> > +              * If the buffer was from the ring and we are not
> rejecting it, consider it
> > +              * a write of a strategy buffer.
> > +              */
> > +             if (strategy->current_was_in_ring)
> > +                     *write_from_ring = true;
>
> Hm. This is set even if the buffer wasn't dirty? I guess we don't expect
> StrategyRejectBuffer() to be called for clean buffers...
>
>
Yes, we do not expect it to be called for clean buffers.
I've added a comment about this assumption.


>
> >  /*
> > diff --git a/src/backend/utils/activity/pgstat_database.c
> b/src/backend/utils/activity/pgstat_database.c
> > index d9275611f0..d3963f59d0 100644
> > --- a/src/backend/utils/activity/pgstat_database.c
> > +++ b/src/backend/utils/activity/pgstat_database.c
> > @@ -47,7 +47,8 @@ pgstat_drop_database(Oid databaseid)
> >  }
> >
> >  /*
> > - * Called from autovacuum.c to report startup of an autovacuum process.
> > + * Called from autovacuum.c to report startup of an autovacuum process
> and
> > + * flush IO Operation statistics.
> >   * We are called before InitPostgres is done, so can't rely on
> MyDatabaseId;
> >   * the db OID must be passed in, instead.
> >   */
> > @@ -72,6 +73,11 @@ pgstat_report_autovac(Oid dboid)
> >       dbentry->stats.last_autovac_time = GetCurrentTimestamp();
> >
> >       pgstat_unlock_entry(entry_ref);
> > +
> > +     /*
> > +      * Report IO Operation statistics
> > +      */
> > +     pgstat_flush_io_ops(false);
> >  }
>
> Hm. I suspect this will always be zero - at this point we haven't
> connected to
> a database, so there really can't have been much, if any, IO. I think I
> suggested doing something here, but on a second look it really doesn't make
> much sense.
>
> Note that that's different from doing something in
> pgstat_report_(vacuum|analyze) - clearly we've done something at that
> point.
>

I've removed this.


>
> >  /*
> > - * Report that the table was just vacuumed.
> > + * Report that the table was just vacuumed and flush IO Operation
> statistics.
> >   */
> >  void
> >  pgstat_report_vacuum(Oid tableoid, bool shared,
> > @@ -257,10 +257,15 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
> >       }
> >
> >       pgstat_unlock_entry(entry_ref);
> > +
> > +     /*
> > +      * Report IO Operations statistics
> > +      */
> > +     pgstat_flush_io_ops(false);
> >  }
> >
> >  /*
> > - * Report that the table was just analyzed.
> > + * Report that the table was just analyzed and flush IO Operation
> statistics.
> >   *
> >   * Caller must provide new live- and dead-tuples estimates, as well as a
> >   * flag indicating whether to reset the changes_since_analyze counter.
> > @@ -340,6 +345,11 @@ pgstat_report_analyze(Relation rel,
> >       }
> >
> >       pgstat_unlock_entry(entry_ref);
> > +
> > +     /*
> > +      * Report IO Operations statistics
> > +      */
> > +     pgstat_flush_io_ops(false);
> >  }
>
> Think it'd be good to amend these comments to say that otherwise stats
> would
> only get flushed after a multi-relatio autovacuum cycle is done / a
> VACUUM/ANALYZE command processed all tables.  Perhaps add the comment to
> one
> of the two functions, and just reference it in the other place?
>

Done


>
>
> > --- a/src/include/utils/backend_status.h
> > +++ b/src/include/utils/backend_status.h
> > @@ -306,6 +306,40 @@ extern const char
> *pgstat_get_crashed_backend_activity(int pid, char *buffer,
> >
>                                 int buflen);
> >  extern uint64 pgstat_get_my_query_id(void);
> >
> > +/* Utility functions */
> > +
> > +/*
> > + * When maintaining an array of information about all valid
> BackendTypes, in
> > + * order to avoid wasting the 0th spot, use this helper to convert a
> valid
> > + * BackendType to a valid location in the array (given that no spot is
> > + * maintained for B_INVALID BackendType).
> > + */
> > +static inline int backend_type_get_idx(BackendType backend_type)
> > +{
> > +     /*
> > +      * backend_type must be one of the valid backend types. If caller
> is
> > +      * maintaining backend information in an array that includes
> B_INVALID,
> > +      * this function is unnecessary.
> > +      */
> > +     Assert(backend_type > B_INVALID && backend_type <=
> BACKEND_NUM_TYPES);
> > +     return backend_type - 1;
> > +}
>
> In function definitions (vs declarations) we put the 'static inline int'
> in a
> separate line from the rest of the function signature.
>

Fixed.


>
> > +/*
> > + * When using a value from an array of information about all valid
> > + * BackendTypes, add 1 to the index before using it as a BackendType to
> adjust
> > + * for not maintaining a spot for B_INVALID BackendType.
> > + */
> > +static inline BackendType idx_get_backend_type(int idx)
> > +{
> > +     int backend_type = idx + 1;
> > +     /*
> > +      * If the array includes a spot for B_INVALID BackendType this
> function is
> > +      * not required.
>
> The comments around this seem a bit over the top, but I also don't mind
> them
> much.
>

Feel free to change them to something shorter. I couldn't think of
something I liked.


>
>
> > Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
> > writes, fsyncs, and extends) done through each IOPath (e.g. shared
> > buffers, local buffers, unbuffered IO) by each type of backend.
>
> Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting the
> latter, except that we already have a bunch of views with that prefix.
>
>
I have thoughts on this but thought it best deferred until after the _data
decision.


>
> > Some of these should always be zero. For example, checkpointer does not
> > use a BufferAccessStrategy (currently), so the "strategy" IOPath for
> > checkpointer will be 0 for all IOOps.
>
> What do you think about returning NULL for the values that we except to
> never
> be non-zero? Perhaps with an assert against non-zero values? Seems like it
> might be helpful for understanding the view.
>

Yes, I like this idea.

Beyond just setting individual cells to NULL, if an entire row would be
NULL, I have now dropped it from the view.

So far, I have omitted from the view all rows for BackendTypes
B_ARCHIVER, B_LOGGER, and B_STARTUP.

Should I also omit rows for B_WAL_RECEIVER and B_WAL_WRITER for now?

I have also omitted rows for IOPATH_STRATEGY for all BackendTypes
*except* B_AUTOVAC_WORKER, B_BACKEND, B_STANDALONE_BACKEND, and
B_BG_WORKER.

Do these seem correct?

I think there are some BackendTypes which will never do IO Operations on
IOPATH_LOCAL but I am not sure which. Do you know which?

As for individual cells which should be NULL, so far what I have is:
- IOPATH_LOCAL + IOOP_FSYNC
I am sure there are others as well. Can you think of any?


>
> > +/*
> > +* When adding a new column to the pg_stat_io view, add a new enum
> > +* value here above IO_NUM_COLUMNS.
> > +*/
> > +enum
> > +{
> > +     IO_COLUMN_BACKEND_TYPE,
> > +     IO_COLUMN_IO_PATH,
> > +     IO_COLUMN_ALLOCS,
> > +     IO_COLUMN_EXTENDS,
> > +     IO_COLUMN_FSYNCS,
> > +     IO_COLUMN_WRITES,
> > +     IO_COLUMN_RESET_TIME,
> > +     IO_NUM_COLUMNS,
> > +};
>
> We typedef pretty much every enum so the enum can be referenced without the
> 'enum' prefix. I'd do that here, even if we don't need it.
>
>
So, I left it anonymous because I didn't want it being used as a type
or referenced anywhere else.

I am interested to hear more about your SQL enums idea from upthread.

- Melanie
From 5d3e3e702cd95e52cb015a23c0bbeccc5debd46d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Tue, 28 Jun 2022 11:33:04 -0400
Subject: [PATCH v25 1/4] Add BackendType for standalone backends

All backends should have a BackendType to enable statistics reporting
per BackendType.

Add a new BackendType for standalone backends, B_STANDALONE_BACKEND (and
alphabetize the BackendTypes). Both the bootstrap backend and single
user mode backends will have BackendType B_STANDALONE_BACKEND.

Author: Melanie Plageman <melanieplage...@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_aaq33UnG4TXq3S-OSXGWj1QGf0sU%2BECH4tNwGFNERkZA%40mail.gmail.com
---
 src/backend/utils/init/miscinit.c | 17 +++++++++++------
 src/include/miscadmin.h           |  5 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index eb43b2c5e5..07e6db1a1c 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0)
 {
 	Assert(!IsPostmasterEnvironment);
 
+	MyBackendType = B_STANDALONE_BACKEND;
+
 	/*
 	 * Start our win32 signal implementation
 	 */
@@ -255,6 +257,9 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_INVALID:
 			backendDesc = "not initialized";
 			break;
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
 		case B_AUTOVAC_LAUNCHER:
 			backendDesc = "autovacuum launcher";
 			break;
@@ -273,6 +278,12 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_CHECKPOINTER:
 			backendDesc = "checkpointer";
 			break;
+		case B_LOGGER:
+			backendDesc = "logger";
+			break;
+		case B_STANDALONE_BACKEND:
+			backendDesc = "standalone backend";
+			break;
 		case B_STARTUP:
 			backendDesc = "startup";
 			break;
@@ -285,12 +296,6 @@ GetBackendTypeDesc(BackendType backendType)
 		case B_WAL_WRITER:
 			backendDesc = "walwriter";
 			break;
-		case B_ARCHIVER:
-			backendDesc = "archiver";
-			break;
-		case B_LOGGER:
-			backendDesc = "logger";
-			break;
 	}
 
 	return backendDesc;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index ea9a56d395..5276bf25a1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -316,18 +316,19 @@ extern void SwitchBackToLocalLatch(void);
 typedef enum BackendType
 {
 	B_INVALID = 0,
+	B_ARCHIVER,
 	B_AUTOVAC_LAUNCHER,
 	B_AUTOVAC_WORKER,
 	B_BACKEND,
 	B_BG_WORKER,
 	B_BG_WRITER,
 	B_CHECKPOINTER,
+	B_LOGGER,
+	B_STANDALONE_BACKEND,
 	B_STARTUP,
 	B_WAL_RECEIVER,
 	B_WAL_SENDER,
 	B_WAL_WRITER,
-	B_ARCHIVER,
-	B_LOGGER,
 } BackendType;
 
 extern PGDLLIMPORT BackendType MyBackendType;
-- 
2.34.1

From 965923536cfe72819b2877e9f1ad4a7e6373b0e8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Tue, 12 Jul 2022 19:53:23 -0400
Subject: [PATCH v25 2/4] Remove unneeded call to pgstat_report_wal()

pgstat_report_stat() will be called before shutdown so an explicit call
to pgstat_report_wal() is wasted.
---
 src/backend/postmaster/walwriter.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index e926f8c27c..beb46dcb55 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
 	}
 
 	if (ShutdownRequestPending)
-	{
-		/*
-		 * Force reporting remaining WAL statistics at process exit.
-		 *
-		 * Since pgstat_report_wal is invoked with 'force' is false in main
-		 * loop to avoid overloading the cumulative stats system, there may
-		 * exist unreported stats counters for the WAL writer.
-		 */
-		pgstat_report_wal(true);
-
 		proc_exit(0);
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
-- 
2.34.1

From 7ba696105c6a45d7b9c7c08fc178d8af4f60c910 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Wed, 29 Jun 2022 18:37:42 -0400
Subject: [PATCH v25 3/4] Track IO operation statistics

Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
location or type of IO done by a backend. For example, the checkpointer
may write a shared buffer out. This would be counted as an IOOp "write"
on an IOPath IOPATH_SHARED by BackendType "checkpointer".

Each IOOp (alloc, extend, fsync, read, write) is counted per IOPath
(local, shared, or strategy) through a call to pgstat_count_io_op().

The primary concern of these statistics is IO operations on data blocks
during the course of normal database operations. IO done by, for
example, the archiver or syslogger is not counted in these statistics.

IOPATH_LOCAL and IOPATH_SHARED IOPaths concern operations on local
and shared buffers.

The IOPATH_STRATEGY IOPath concerns buffers
alloc'd/extended/fsync'd/read/written as part of a BufferAccessStrategy.

IOOP_ALLOC is counted for IOPATH_SHARED and IOPATH_LOCAL whenever a
buffer is acquired through [Local]BufferAlloc(). IOOP_ALLOC for
IOPATH_STRATEGY is counted whenever a buffer already in the strategy
ring is reused. And IOOP_WRITE for IOPATH_STRATEGY is counted whenever
the reused dirty buffer is written out.

Stats on IOOps for all IOPaths for a backend are initially accumulated
locally.

Later they are flushed to shared memory and accumulated with those from
all other backends, exited and live. The accumulated stats in shared
memory could be extended in the future with per-backend stats -- useful
for per connection IO statistics and monitoring.

Some BackendTypes will not flush their pending statistics at regular
intervals and explicitly call pgstat_flush_io_ops() during the course of
normal operations to flush their backend-local IO Operation statistics
to shared memory in a timely manner.

Author: Melanie Plageman <melanieplage...@gmail.com>
Reviewed-by: Justin Pryzby <pry...@telsasoft.com>, Kyotaro Horiguchi <horikyota....@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 src/backend/postmaster/checkpointer.c         |   1 +
 src/backend/storage/buffer/bufmgr.c           |  53 ++++-
 src/backend/storage/buffer/freelist.c         |  51 ++++-
 src/backend/storage/buffer/localbuf.c         |   6 +
 src/backend/storage/sync/sync.c               |   2 +
 src/backend/utils/activity/Makefile           |   1 +
 src/backend/utils/activity/pgstat.c           |  36 ++++
 src/backend/utils/activity/pgstat_bgwriter.c  |   7 +-
 .../utils/activity/pgstat_checkpointer.c      |   7 +-
 src/backend/utils/activity/pgstat_io_ops.c    | 192 ++++++++++++++++++
 src/backend/utils/activity/pgstat_relation.c  |  19 +-
 src/backend/utils/activity/pgstat_wal.c       |   4 +-
 src/backend/utils/adt/pgstatfuncs.c           |   4 +-
 src/include/miscadmin.h                       |   2 +
 src/include/pgstat.h                          |  58 ++++++
 src/include/storage/buf_internals.h           |   2 +-
 src/include/utils/backend_status.h            |  36 ++++
 src/include/utils/pgstat_internal.h           |  24 +++
 18 files changed, 485 insertions(+), 20 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io_ops.c

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc14..a06331e1eb 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1116,6 +1116,7 @@ ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 		if (!AmBackgroundWriterProcess())
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
+		pgstat_count_io_op(IOOP_FSYNC, IOPATH_SHARED);
 		return false;
 	}
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c7d7abcd73..e872d7edc6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,7 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
-static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -813,6 +813,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
+	IOPath io_path;
 	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
@@ -978,8 +979,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
+	if (isLocalBuf)
+		io_path = IOPATH_LOCAL;
+	else if (strategy != NULL)
+		io_path = IOPATH_STRATEGY;
+	else
+		io_path = IOPATH_SHARED;
+
 	if (isExtend)
 	{
+
+		pgstat_count_io_op(IOOP_EXTEND, io_path);
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
 		/* don't set checksum for all-zero page */
@@ -1010,6 +1020,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
+			pgstat_count_io_op(IOOP_READ, io_path);
+
 			if (track_io_timing)
 			{
 				INSTR_TIME_SET_CURRENT(io_time);
@@ -1180,6 +1192,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
+		bool write_from_ring = false;
 		/*
 		 * Ensure, while the spinlock's not yet held, that there's a free
 		 * refcount entry.
@@ -1227,6 +1240,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
 										 LW_SHARED))
 			{
+				IOPath iopath;
 				/*
 				 * If using a nondefault strategy, and writing the buffer
 				 * would require a WAL flush, let the strategy decide whether
@@ -1244,7 +1258,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					UnlockBufHdr(buf, buf_state);
 
 					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
+						StrategyRejectBuffer(strategy, buf, &write_from_ring))
 					{
 						/* Drop lock/pin and loop around for another buffer */
 						LWLockRelease(BufferDescriptorGetContentLock(buf));
@@ -1253,13 +1267,27 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 					}
 				}
 
+				/*
+				 * When a strategy is in use, if the target dirty buffer is an existing
+				 * strategy buffer being reused, count this as a strategy write for the
+				 * purposes of IO Operations statistics tracking.
+				 *
+				 * All dirty shared buffers upon first being added to the ring will be
+				 * counted as shared buffer writes.
+				 *
+				 * When a strategy is not in use, the write can only be a only be a
+				 * "regular" write of a dirty shared buffer.
+				 */
+
+				iopath = write_from_ring ? IOPATH_STRATEGY : IOPATH_SHARED;
+
 				/* OK, do the I/O */
 				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
 														  smgr->smgr_rlocator.locator.spcOid,
 														  smgr->smgr_rlocator.locator.dbOid,
 														  smgr->smgr_rlocator.locator.relNumber);
 
-				FlushBuffer(buf, NULL);
+				FlushBuffer(buf, NULL, iopath);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
@@ -2563,7 +2591,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	PinBuffer_Locked(bufHdr);
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 
 	LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 
@@ -2810,9 +2838,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * IOPath will always be IOPATH_SHARED except when a buffer access strategy is
+ * used and the buffer being flushed is a buffer from the strategy ring.
  */
 static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOPath iopath)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
@@ -2892,6 +2923,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
+	pgstat_count_io_op(IOOP_WRITE, iopath);
+
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
 
@@ -3539,6 +3572,8 @@ FlushRelationBuffers(Relation rel)
 						  localpage,
 						  false);
 
+				pgstat_count_io_op(IOOP_WRITE, IOPATH_LOCAL);
+
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
@@ -3574,7 +3609,7 @@ FlushRelationBuffers(Relation rel)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, RelationGetSmgr(rel));
+			FlushBuffer(bufHdr, RelationGetSmgr(rel), IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3669,7 +3704,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, srelent->srel);
+			FlushBuffer(bufHdr, srelent->srel, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3877,7 +3912,7 @@ FlushDatabaseBuffers(Oid dbid)
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-			FlushBuffer(bufHdr, NULL);
+			FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
 		}
@@ -3904,7 +3939,7 @@ FlushOneBuffer(Buffer buffer)
 
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
-	FlushBuffer(bufHdr, NULL);
+	FlushBuffer(bufHdr, NULL, IOPATH_SHARED);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 990e081aae..29f5cbeab6 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
@@ -212,8 +213,20 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy, buf_state);
-		if (buf != NULL)
+		if (strategy->current_was_in_ring)
+		{
+			/*
+			 * When a strategy is in use, reused buffers from the strategy ring will
+			 * be counted as allocations for the purposes of IO Operation statistics
+			 * tracking.
+			 *
+			 * However, even when a strategy is in use, if a new buffer must be
+			 * allocated from shared buffers and added to the ring, this is counted
+			 * as a IOPATH_SHARED allocation.
+			 */
+			pgstat_count_io_op(IOOP_ALLOC, IOPATH_STRATEGY);
 			return buf;
+		}
 	}
 
 	/*
@@ -247,6 +260,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
+	pgstat_count_io_op(IOOP_ALLOC, IOPATH_SHARED);
 	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
@@ -682,16 +696,38 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
  * if this buffer should be written and re-used.
  */
 bool
-StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
+StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *write_from_ring)
 {
-	/* We only do this in bulkread mode */
+
+	/*
+	 * We only reject reusing and writing out the strategy buffer in bulkread
+	 * mode.
+	 */
 	if (strategy->btype != BAS_BULKREAD)
+	{
+		/*
+		 * If the buffer was from the ring and we are not rejecting it, consider it
+		 * a write of a strategy buffer. Note that this assumes that the buffer is
+		 * dirty.
+		 */
+		if (strategy->current_was_in_ring)
+			*write_from_ring = true;
 		return false;
+	}
 
-	/* Don't muck with behavior of normal buffer-replacement strategy */
+	/*
+	 * Don't muck with behavior of normal buffer-replacement strategy. Though we
+	 * are not rejecting this buffer, write_from_ring is false because shared
+	 * buffers that are added to the ring, either initially or when reuse is not
+	 * possible because all existing strategy buffers are pinned, are not
+	 * considered strategy writes for the purposes of IO Operation statistics.
+	 */
 	if (!strategy->current_was_in_ring ||
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
+	{
+		*write_from_ring = false;
 		return false;
+	}
 
 	/*
 	 * Remove the dirty buffer from the ring; necessary to prevent infinite
@@ -699,5 +735,12 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
 
+	/*
+	 * Caller should not use this flag since the buffer is being rejected (and it
+	 * should have been initialized to false anyway) and will not be written out.
+	 * Set the flag here anyway for clarity.
+	 */
+	*write_from_ring = false;
+
 	return true;
 }
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 9c038851d7..edd3296dd7 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "access/parallel.h"
 #include "catalog/catalog.h"
 #include "executor/instrument.h"
@@ -123,6 +124,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	if (LocalBufHash == NULL)
 		InitLocalBuffers();
 
+
 	/* See if the desired buffer already exists */
 	hresult = (LocalBufferLookupEnt *)
 		hash_search(LocalBufHash, (void *) &newTag, HASH_FIND, NULL);
@@ -196,6 +198,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				LocalRefCount[b]++;
 				ResourceOwnerRememberBuffer(CurrentResourceOwner,
 											BufferDescriptorGetBuffer(bufHdr));
+
+				pgstat_count_io_op(IOOP_ALLOC, IOPATH_LOCAL);
 				break;
 			}
 		}
@@ -226,6 +230,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 				  localpage,
 				  false);
 
+		pgstat_count_io_op(IOOP_WRITE, IOPATH_LOCAL);
+
 		/* Mark not-dirty now in case we error out below */
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e1fb631003..20e259edef 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -432,6 +432,8 @@ ProcessSyncRequests(void)
 					total_elapsed += elapsed;
 					processed++;
 
+					pgstat_count_io_op(IOOP_FSYNC, IOPATH_SHARED);
+
 					if (log_checkpoints)
 						elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f ms",
 							 processed,
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index a2e8507fd6..0098785089 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	pgstat_checkpointer.o \
 	pgstat_database.o \
 	pgstat_function.o \
+	pgstat_io_ops.o \
 	pgstat_relation.o \
 	pgstat_replslot.o \
 	pgstat_shmem.o \
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 88e5dd1b2b..3238d9ba85 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -359,6 +359,15 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
 		.snapshot_cb = pgstat_checkpointer_snapshot_cb,
 	},
 
+	[PGSTAT_KIND_IOOPS] = {
+		.name = "io_ops",
+
+		.fixed_amount = true,
+
+		.reset_all_cb = pgstat_io_ops_reset_all_cb,
+		.snapshot_cb = pgstat_io_ops_snapshot_cb,
+	},
+
 	[PGSTAT_KIND_SLRU] = {
 		.name = "slru",
 
@@ -628,6 +637,9 @@ pgstat_report_stat(bool force)
 	/* flush database / relation / function / ... stats */
 	partial_flush |= pgstat_flush_pending_entries(nowait);
 
+	/* flush IO Operations stats */
+	partial_flush |= pgstat_flush_io_ops(nowait);
+
 	/* flush wal stats */
 	partial_flush |= pgstat_flush_wal(nowait);
 
@@ -1312,6 +1324,12 @@ pgstat_write_statsfile(void)
 	pgstat_build_snapshot_fixed(PGSTAT_KIND_CHECKPOINTER);
 	write_chunk_s(fpout, &pgStatLocal.snapshot.checkpointer);
 
+	/*
+	 * Write IO Operations stats struct
+	 */
+	pgstat_build_snapshot_fixed(PGSTAT_KIND_IOOPS);
+	write_chunk_s(fpout, &pgStatLocal.snapshot.io_ops);
+
 	/*
 	 * Write SLRU stats struct
 	 */
@@ -1427,8 +1445,10 @@ pgstat_read_statsfile(void)
 	FILE	   *fpin;
 	int32		format_id;
 	bool		found;
+	PgStat_BackendIOPathOps io_stats;
 	const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 	PgStat_ShmemControl *shmem = pgStatLocal.shmem;
+	PgStatShared_BackendIOPathOps *io_stats_shmem = &shmem->io_ops;
 
 	/* shouldn't be called from postmaster */
 	Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
@@ -1486,6 +1506,22 @@ pgstat_read_statsfile(void)
 	if (!read_chunk_s(fpin, &shmem->checkpointer.stats))
 		goto error;
 
+	/*
+	 * Read IO Operations stats struct
+	 */
+	if (!read_chunk_s(fpin, &io_stats))
+		goto error;
+
+	io_stats_shmem->stat_reset_timestamp = io_stats.stat_reset_timestamp;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStat_IOPathOps *stats = &io_stats.stats[i];
+		PgStatShared_IOPathOps *stats_shmem = &io_stats_shmem->stats[i];
+
+		memcpy(stats_shmem->data, stats->data, sizeof(stats->data));
+	}
+
 	/*
 	 * Read SLRU stats struct
 	 */
diff --git a/src/backend/utils/activity/pgstat_bgwriter.c b/src/backend/utils/activity/pgstat_bgwriter.c
index fbb1edc527..3d7f90a1b7 100644
--- a/src/backend/utils/activity/pgstat_bgwriter.c
+++ b/src/backend/utils/activity/pgstat_bgwriter.c
@@ -24,7 +24,7 @@ PgStat_BgWriterStats PendingBgWriterStats = {0};
 
 
 /*
- * Report bgwriter statistics
+ * Report bgwriter and IO Operation statistics
  */
 void
 pgstat_report_bgwriter(void)
@@ -56,6 +56,11 @@ pgstat_report_bgwriter(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingBgWriterStats, 0, sizeof(PendingBgWriterStats));
+
+	/*
+	 * Report IO Operations statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_checkpointer.c b/src/backend/utils/activity/pgstat_checkpointer.c
index af8d513e7b..cfcf127210 100644
--- a/src/backend/utils/activity/pgstat_checkpointer.c
+++ b/src/backend/utils/activity/pgstat_checkpointer.c
@@ -24,7 +24,7 @@ PgStat_CheckpointerStats PendingCheckpointerStats = {0};
 
 
 /*
- * Report checkpointer statistics
+ * Report checkpointer and IO Operation statistics
  */
 void
 pgstat_report_checkpointer(void)
@@ -62,6 +62,11 @@ pgstat_report_checkpointer(void)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&PendingCheckpointerStats, 0, sizeof(PendingCheckpointerStats));
+
+	/*
+	 * Report IO Operation statistics
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io_ops.c b/src/backend/utils/activity/pgstat_io_ops.c
new file mode 100644
index 0000000000..6e7351660f
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_io_ops.c
@@ -0,0 +1,192 @@
+/* -------------------------------------------------------------------------
+ *
+ * pgstat_io_ops.c
+ *	  Implementation of IO operation statistics.
+ *
+ * This file contains the implementation of IO operation statistics. It is kept
+ * separate from pgstat.c to enforce the line between the statistics access /
+ * storage implementation and the details about individual types of
+ * statistics.
+ *
+ * Copyright (c) 2001-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_io_ops.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_internal.h"
+
+static PgStat_IOPathOps pending_IOOpStats;
+bool have_ioopstats = false;
+
+
+/*
+ * Flush out locally pending IO Operation statistics entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * If nowait is true, this function returns true if the lock could not be
+ * acquired. Otherwise return false.
+ */
+bool
+pgstat_flush_io_ops(bool nowait)
+{
+	PgStatShared_IOPathOps *stats_shmem;
+
+	if (!have_ioopstats)
+		return false;
+
+	stats_shmem =
+		&pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)];
+
+	if (!nowait)
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE))
+		return true;
+
+
+	for (int i = 0; i < IOPATH_NUM_TYPES; i++)
+	{
+		PgStat_IOOpCounters *sharedent = &stats_shmem->data[i];
+		PgStat_IOOpCounters *pendingent = &pending_IOOpStats.data[i];
+
+#define IO_OP_ACC(fld) sharedent->fld += pendingent->fld
+		IO_OP_ACC(allocs);
+		IO_OP_ACC(extends);
+		IO_OP_ACC(fsyncs);
+		IO_OP_ACC(reads);
+		IO_OP_ACC(writes);
+#undef IO_OP_ACC
+	}
+
+	LWLockRelease(&stats_shmem->lock);
+
+	memset(&pending_IOOpStats, 0, sizeof(pending_IOOpStats));
+
+	have_ioopstats = false;
+
+	return false;
+}
+
+void
+pgstat_io_ops_snapshot_cb(void)
+{
+	PgStatShared_BackendIOPathOps *all_backend_stats_shmem = &pgStatLocal.shmem->io_ops;
+	PgStat_BackendIOPathOps *all_backend_stats_snap = &pgStatLocal.snapshot.io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOPathOps *stats_shmem = &all_backend_stats_shmem->stats[i];
+		PgStat_IOPathOps *stats_snap = &all_backend_stats_snap->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOPathOps to protect the
+		 * reset timestamp as well.
+		 */
+		if (i == 0)
+			all_backend_stats_snap->stat_reset_timestamp = all_backend_stats_shmem->stat_reset_timestamp;
+
+		memcpy(stats_snap->data, stats_shmem->data, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+
+}
+
+void
+pgstat_io_ops_reset_all_cb(TimestampTz ts)
+{
+	PgStatShared_BackendIOPathOps *all_backend_stats_shmem = &pgStatLocal.shmem->io_ops;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		PgStatShared_IOPathOps *stats_shmem = &all_backend_stats_shmem->stats[i];
+
+		LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Use the lock in the first BackendType's PgStat_IOPathOps to protect the
+		 * reset timestamp as well.
+		 */
+		if (i == 0)
+			all_backend_stats_shmem->stat_reset_timestamp = ts;
+
+		memset(stats_shmem->data, 0, sizeof(stats_shmem->data));
+		LWLockRelease(&stats_shmem->lock);
+	}
+}
+
+void
+pgstat_count_io_op(IOOp io_op, IOPath io_path)
+{
+	PgStat_IOOpCounters *pending_counters = &pending_IOOpStats.data[io_path];
+
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			pending_counters->allocs++;
+			break;
+		case IOOP_EXTEND:
+			pending_counters->extends++;
+			break;
+		case IOOP_FSYNC:
+			pending_counters->fsyncs++;
+			break;
+		case IOOP_READ:
+			pending_counters->reads++;
+			break;
+		case IOOP_WRITE:
+			pending_counters->writes++;
+			break;
+	}
+
+	have_ioopstats = true;
+}
+
+PgStat_BackendIOPathOps*
+pgstat_fetch_backend_io_path_ops(void)
+{
+	pgstat_snapshot_fixed(PGSTAT_KIND_IOOPS);
+
+	return &pgStatLocal.snapshot.io_ops;
+}
+
+const char *
+pgstat_io_path_desc(IOPath io_path)
+{
+	switch (io_path)
+	{
+		case IOPATH_LOCAL:
+			return "Local";
+		case IOPATH_SHARED:
+			return "Shared";
+		case IOPATH_STRATEGY:
+			return "Strategy";
+	}
+
+	elog(ERROR, "unrecognized IOPath value: %d", io_path);
+}
+
+const char *
+pgstat_io_op_desc(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_ALLOC:
+			return "Alloc";
+		case IOOP_EXTEND:
+			return "Extend";
+		case IOOP_FSYNC:
+			return "Fsync";
+		case IOOP_READ:
+			return "Read";
+		case IOOP_WRITE:
+			return "Write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+}
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb6..a17b3336db 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,7 +205,7 @@ pgstat_drop_relation(Relation rel)
 }
 
 /*
- * Report that the table was just vacuumed.
+ * Report that the table was just vacuumed and flush IO Operation statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
@@ -257,10 +257,18 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics now. pgstat_report_stat() will flush IO
+	 * Operation stats, however this will not be called after an entire
+	 * autovacuum cycle is done -- which will likely vacuum many relations -- or
+	 * until the VACUUM command has processed all tables and committed.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
- * Report that the table was just analyzed.
+ * Report that the table was just analyzed and flush IO Operation statistics.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -340,6 +348,13 @@ pgstat_report_analyze(Relation rel,
 	}
 
 	pgstat_unlock_entry(entry_ref);
+
+	/*
+	 * Flush IO Operations statistics explicitly for the same reason as in
+	 * pgstat_report_vacuum(). We don't want to wait for an entire ANALYZE
+	 * command to complete before updating stats.
+	 */
+	pgstat_flush_io_ops(false);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 5a878bd115..9cac407b42 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -34,7 +34,7 @@ static WalUsage prevWalUsage;
 
 /*
  * Calculate how much WAL usage counters have increased and update
- * shared statistics.
+ * shared WAL and IO Operation statistics.
  *
  * Must be called by processes that generate WAL, that do not call
  * pgstat_report_stat(), like walwriter.
@@ -43,6 +43,8 @@ void
 pgstat_report_wal(bool force)
 {
 	pgstat_flush_wal(force);
+
+	pgstat_flush_io_ops(force);
 }
 
 /*
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 893690dad5..6259cc4f4c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -2104,6 +2104,8 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		pgstat_reset_of_kind(PGSTAT_KIND_BGWRITER);
 		pgstat_reset_of_kind(PGSTAT_KIND_CHECKPOINTER);
 	}
+	else if (strcmp(target, "io") == 0)
+		pgstat_reset_of_kind(PGSTAT_KIND_IOOPS);
 	else if (strcmp(target, "recovery_prefetch") == 0)
 		XLogPrefetchResetStats();
 	else if (strcmp(target, "wal") == 0)
@@ -2112,7 +2114,7 @@ pg_stat_reset_shared(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("unrecognized reset target: \"%s\"", target),
-				 errhint("Target must be \"archiver\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
+				 errhint("Target must be \"archiver\", \"io\", \"bgwriter\", \"recovery_prefetch\", or \"wal\".")));
 
 	PG_RETURN_VOID();
 }
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 5276bf25a1..61e95135f2 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -331,6 +331,8 @@ typedef enum BackendType
 	B_WAL_WRITER,
 } BackendType;
 
+#define BACKEND_NUM_TYPES B_WAL_WRITER
+
 extern PGDLLIMPORT BackendType MyBackendType;
 
 extern const char *GetBackendTypeDesc(BackendType backendType);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ac28f813b4..d6ed6ec864 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -14,6 +14,7 @@
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
+#include "storage/lwlock.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
 #include "utils/relcache.h"
@@ -48,6 +49,7 @@ typedef enum PgStat_Kind
 	PGSTAT_KIND_ARCHIVER,
 	PGSTAT_KIND_BGWRITER,
 	PGSTAT_KIND_CHECKPOINTER,
+	PGSTAT_KIND_IOOPS,
 	PGSTAT_KIND_SLRU,
 	PGSTAT_KIND_WAL,
 } PgStat_Kind;
@@ -276,6 +278,50 @@ typedef struct PgStat_CheckpointerStats
 	PgStat_Counter buf_fsync_backend;
 } PgStat_CheckpointerStats;
 
+/*
+ * Types related to counting IO Operations for various IO Paths
+ */
+
+typedef enum IOOp
+{
+	IOOP_ALLOC,
+	IOOP_EXTEND,
+	IOOP_FSYNC,
+	IOOP_READ,
+	IOOP_WRITE,
+} IOOp;
+
+#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+
+typedef enum IOPath
+{
+	IOPATH_LOCAL,
+	IOPATH_SHARED,
+	IOPATH_STRATEGY,
+} IOPath;
+
+#define IOPATH_NUM_TYPES (IOPATH_STRATEGY + 1)
+
+typedef struct PgStat_IOOpCounters
+{
+	PgStat_Counter allocs;
+	PgStat_Counter extends;
+	PgStat_Counter fsyncs;
+	PgStat_Counter reads;
+	PgStat_Counter writes;
+} PgStat_IOOpCounters;
+
+typedef struct PgStat_IOPathOps
+{
+	PgStat_IOOpCounters data[IOPATH_NUM_TYPES];
+} PgStat_IOPathOps;
+
+typedef struct PgStat_BackendIOPathOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStat_IOPathOps stats[BACKEND_NUM_TYPES];
+} PgStat_BackendIOPathOps;
+
 typedef struct PgStat_StatDBEntry
 {
 	PgStat_Counter n_xact_commit;
@@ -453,6 +499,18 @@ extern void pgstat_report_checkpointer(void);
 extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_count_io_op(IOOp io_op, IOPath io_path);
+extern bool pgstat_flush_io_ops(bool nowait);
+extern PgStat_BackendIOPathOps *pgstat_fetch_backend_io_path_ops(void);
+extern PgStat_Counter pgstat_fetch_cumulative_io_ops(IOPath io_path, IOOp io_op);
+extern const char *pgstat_io_op_desc(IOOp io_op);
+extern const char *pgstat_io_path_desc(IOPath io_path);
+
+
 /*
  * Functions in pgstat_database.c
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 69e45900ba..b69c5f7e3c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -313,7 +313,7 @@ extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
 extern void StrategyFreeBuffer(BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
-								 BufferDesc *buf);
+								 BufferDesc *buf, bool *write_from_ring);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(int bgwprocno);
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7403bca25e..d9b6d12acc 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -306,6 +306,42 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
 													   int buflen);
 extern uint64 pgstat_get_my_query_id(void);
 
+/* Utility functions */
+
+/*
+ * When maintaining an array of information about all valid BackendTypes, in
+ * order to avoid wasting the 0th spot, use this helper to convert a valid
+ * BackendType to a valid location in the array (given that no spot is
+ * maintained for B_INVALID BackendType).
+ */
+static inline int
+backend_type_get_idx(BackendType backend_type)
+{
+	/*
+	 * backend_type must be one of the valid backend types. If caller is
+	 * maintaining backend information in an array that includes B_INVALID,
+	 * this function is unnecessary.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type - 1;
+}
+
+/*
+ * When using a value from an array of information about all valid
+ * BackendTypes, add 1 to the index before using it as a BackendType to adjust
+ * for not maintaining a spot for B_INVALID BackendType.
+ */
+static inline BackendType
+idx_get_backend_type(int idx)
+{
+	int backend_type = idx + 1;
+	/*
+	 * If the array includes a spot for B_INVALID BackendType this function is
+	 * not required.
+	 */
+	Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
+	return backend_type;
+}
 
 /* ----------
  * Support functions for the SQL-callable functions to
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 9303d05427..3151c43dfe 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -329,6 +329,19 @@ typedef struct PgStatShared_Checkpointer
 	PgStat_CheckpointerStats reset_offset;
 } PgStatShared_Checkpointer;
 
+typedef struct PgStatShared_IOPathOps
+{
+	LWLock		lock;
+	PgStat_IOOpCounters data[IOPATH_NUM_TYPES];
+} PgStatShared_IOPathOps;
+
+typedef struct PgStatShared_BackendIOPathOps
+{
+	TimestampTz stat_reset_timestamp;
+	PgStatShared_IOPathOps stats[BACKEND_NUM_TYPES];
+} PgStatShared_BackendIOPathOps;
+
+
 typedef struct PgStatShared_SLRU
 {
 	/* lock protects ->stats */
@@ -419,6 +432,7 @@ typedef struct PgStat_ShmemControl
 	PgStatShared_Archiver archiver;
 	PgStatShared_BgWriter bgwriter;
 	PgStatShared_Checkpointer checkpointer;
+	PgStatShared_BackendIOPathOps io_ops;
 	PgStatShared_SLRU slru;
 	PgStatShared_Wal wal;
 } PgStat_ShmemControl;
@@ -442,6 +456,8 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
+	PgStat_BackendIOPathOps io_ops;
+
 	PgStat_SLRUStats slru[SLRU_NUM_ELEMENTS];
 
 	PgStat_WalStats wal;
@@ -549,6 +565,14 @@ extern void pgstat_database_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 
+/*
+ * Functions in pgstat_io_ops.c
+ */
+
+extern void pgstat_io_ops_snapshot_cb(void);
+extern void pgstat_io_ops_reset_all_cb(TimestampTz ts);
+
+
 /*
  * Functions in pgstat_relation.c
  */
-- 
2.34.1

From f1dd9c1ccddce6ee4cad4df70f3475ac2a83bca3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Mon, 4 Jul 2022 15:44:17 -0400
Subject: [PATCH v25 4/4] Add system view tracking IO ops per backend type

Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
extends, fsyncs, reads, and writes) done through each IOPath (shared
buffers, local buffers, strategy buffers) by each type of backend (e.g.
client backend, checkpointer).

Some IOPaths are not used by some BackendTypes and will not be in the
view. For example, checkpointer does not use a BufferAccessStrategy
(currently), so there will be no row for the "strategy" IOPath for
checkpointer.

Some IOOps are invalid in combination with certain IOPaths. Those cells
will be NULL in the view. For example, local buffers are not fsync'd so
cells for all BackendTypes for IOPATH_STRATEGY and IOOP_FSYNC will be
NULL.

View stats are fetched from statistics incremented when a backend
performs an IO Operation and maintained by the cumulative statistics
subsystem.

Each row of the view is stats for a particular BackendType for a
particular IOPath (e.g. shared buffer accesses by checkpointer) and
each column in the view is the total number of IO Operations done (e.g.
writes).
So a cell in the view would be, for example, the number of shared
buffers written by checkpointer since the last stats reset.

Note that some of the cells in the view are redundant with fields in
pg_stat_bgwriter (e.g. buffers_backend), however these have been kept in
pg_stat_bgwriter for backwards compatibility. Deriving the redundant
pg_stat_bgwriter stats from the IO operations stats structures was also
problematic due to the separate reset targets for 'bgwriter' and
'io'.

Suggested by Andres Freund

Author: Melanie Plageman <melanieplage...@gmail.com>
Reviewed-by: Justin Pryzby <pry...@telsasoft.com>, Kyotaro Horiguchi <horikyota....@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
---
 doc/src/sgml/monitoring.sgml         | 117 ++++++++++++++++++++++++++-
 src/backend/catalog/system_views.sql |  12 +++
 src/backend/utils/adt/pgstatfuncs.c  | 106 ++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat      |   9 +++
 src/test/regress/expected/rules.out  |   9 +++
 src/test/regress/expected/stats.out  |  59 ++++++++++++++
 src/test/regress/sql/stats.sql       |  34 ++++++++
 7 files changed, 345 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4549c2560e..2b0ee495ee 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io</structname><indexterm><primary>pg_stat_io</primary></indexterm></entry>
+      <entry>A row for each IO path for each backend type showing
+      statistics about backend IO operations. See
+       <link linkend="monitoring-pg-stat-io-view">
+       <structname>pg_stat_io</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_wal</structname><indexterm><primary>pg_stat_wal</primary></indexterm></entry>
       <entry>One row only, showing statistics about WAL activity. See
@@ -3595,7 +3604,111 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
       </para>
       <para>
-       Time at which these statistics were last reset
+       Time at which these statistics were last reset.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
+ <sect2 id="monitoring-pg-stat-io-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io</structname> view has a row for each backend
+   type for each possible IO path containing global data for the cluster for
+   that backend and IO path.
+  </para>
+
+  <table id="pg-stat-io-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>backend_type</structfield> <type>text</type>
+      </para>
+      <para>
+       Type of backend (e.g. background worker, autovacuum worker).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>io_path</structfield> <type>text</type>
+      </para>
+      <para>
+       IO path taken (e.g. shared buffers, direct).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>alloc</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of buffers allocated.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>extend</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks extended.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>fsync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks fsynced.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks read.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of blocks written.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+      </para>
+      <para>
+       Time at which these statistics were last reset.
       </para></entry>
      </row>
     </tbody>
@@ -5355,6 +5468,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         the <structname>pg_stat_bgwriter</structname>
         view, <literal>archiver</literal> to reset all the counters shown in
         the <structname>pg_stat_archiver</structname> view,
+        <literal>io</literal> to reset all the counters shown in the
+        <structname>pg_stat_io</structname> view,
         <literal>wal</literal> to reset all the counters shown in the
         <structname>pg_stat_wal</structname> view or
         <literal>recovery_prefetch</literal> to reset all the counters shown
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fedaed533b..1fe3b07daa 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1115,6 +1115,18 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_io AS
+SELECT
+       b.backend_type,
+       b.io_path,
+       b.alloc,
+       b.extend,
+       b.fsync,
+       b.read,
+       b.write,
+       b.stats_reset
+FROM pg_stat_get_io() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6259cc4f4c..21d54ec9b1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1739,6 +1739,112 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
+/*
+* When adding a new column to the pg_stat_io view, add a new enum
+* value here above IO_NUM_COLUMNS.
+*/
+enum
+{
+	IO_COLUMN_BACKEND_TYPE,
+	IO_COLUMN_IO_PATH,
+	IO_COLUMN_ALLOCS,
+	IO_COLUMN_EXTENDS,
+	IO_COLUMN_FSYNCS,
+	IO_COLUMN_READS,
+	IO_COLUMN_WRITES,
+	IO_COLUMN_RESET_TIME,
+	IO_NUM_COLUMNS,
+};
+
+Datum
+pg_stat_get_io(PG_FUNCTION_ARGS)
+{
+	PgStat_BackendIOPathOps *io_stats;
+	PgStat_IOPathOps *io_path_ops;
+	ReturnSetInfo *rsinfo;
+	Datum reset_time;
+
+	SetSingleFuncCall(fcinfo, 0);
+	io_stats = pgstat_fetch_backend_io_path_ops();
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	/*
+		* Currently it is not permitted to reset IO operation stats for individual
+		* IO Paths or individual BackendTypes. All IO Operation statistics are
+		* reset together. As such, it is easiest to reuse the first reset timestamp
+		* available.
+		*/
+	reset_time = TimestampTzGetDatum(io_stats->stat_reset_timestamp);
+
+	io_path_ops = io_stats->stats;
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		bool can_use_strategy;
+		PgStat_IOOpCounters *counters = io_path_ops->data;
+		BackendType backend_type = idx_get_backend_type(i);
+
+	 /*
+		* IO Operation statistics are not collected for all BackendTypes.
+		* For those BackendTypes without IO Operation stats, skip representing them
+		* in the view altogether.
+		*
+		* The following BackendTypes do not participate in the cumulative stats
+		* subsystem or do not do IO operations worth reporting statistics on:
+		* - Startup process because it does not have relation OIDs
+		* - Syslogger because it is not connected to shared memory
+		* - Archiver because most relevant archiving IO is delegated to a
+		*   specialized command or module
+		*/
+		if (backend_type == B_ARCHIVER || backend_type == B_LOGGER || backend_type
+				== B_STARTUP)
+			continue;
+
+		/*
+		 * Not all BackendTypes will use a BufferAccessStrategy. Omit those rows
+		 * from the view.
+		 */
+		can_use_strategy = backend_type == B_AUTOVAC_WORKER || backend_type ==
+			B_BACKEND || backend_type == B_STANDALONE_BACKEND || backend_type ==
+			B_BG_WORKER;
+
+		for (int j = 0; j < IOPATH_NUM_TYPES; j++)
+		{
+			Datum values[IO_NUM_COLUMNS];
+			bool nulls[IO_NUM_COLUMNS];
+
+			if (j == IOPATH_STRATEGY && !can_use_strategy)
+				continue;
+
+			memset(values, 0, sizeof(values));
+			memset(nulls, 0, sizeof(nulls));
+
+			values[IO_COLUMN_BACKEND_TYPE] = CStringGetTextDatum(GetBackendTypeDesc(backend_type));
+			values[IO_COLUMN_IO_PATH] = CStringGetTextDatum(pgstat_io_path_desc(j));
+			values[IO_COLUMN_RESET_TIME] = TimestampTzGetDatum(reset_time);
+			values[IO_COLUMN_ALLOCS] = Int64GetDatum(counters->allocs);
+			values[IO_COLUMN_EXTENDS] = Int64GetDatum(counters->extends);
+			values[IO_COLUMN_FSYNCS] = Int64GetDatum(counters->fsyncs);
+			values[IO_COLUMN_READS] = Int64GetDatum(counters->reads);
+			values[IO_COLUMN_WRITES] = Int64GetDatum(counters->writes);
+
+		 /*
+			* Temporary tables using local buffers are not logged and thus do not
+			* require fsync'ing. Set this cell to NULL to differentiate between an
+			* invalid combination and 0 observed IO Operations.
+			*/
+			if (j == IOPATH_LOCAL)
+				nulls[IO_COLUMN_FSYNCS] = true;
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
+			counters++;
+		}
+
+		io_path_ops++;
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * Returns statistics of WAL activity
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..bec3c93991 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5646,6 +5646,15 @@
   proname => 'pg_stat_get_buf_alloc', provolatile => 's', proparallel => 'r',
   prorettype => 'int8', proargtypes => '', prosrc => 'pg_stat_get_buf_alloc' },
 
+{ oid => '8459', descr => 'statistics: counts of all IO operations done through all IO paths by each type of backend.',
+  proname => 'pg_stat_get_io', provolatile => 's', proisstrict => 'f',
+  prorows => '52', proretset => 't',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,text,int8,int8,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_path,alloc,extend,fsync,read,write,stats_reset}',
+  prosrc => 'pg_stat_get_io' },
+
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7ec3d2688f..2b269e005e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1873,6 +1873,15 @@ pg_stat_gssapi| SELECT s.pid,
     s.gss_enc AS encrypted
    FROM pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
   WHERE (s.client_port IS NOT NULL);
+pg_stat_io| SELECT b.backend_type,
+    b.io_path,
+    b.alloc,
+    b.extend,
+    b.fsync,
+    b.read,
+    b.write,
+    b.stats_reset
+   FROM pg_stat_get_io() b(backend_type, io_path, alloc, extend, fsync, read, write, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 5b0ebf090f..6dade03b65 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -554,4 +554,63 @@ SELECT pg_stat_get_live_tuples(:drop_stats_test_subxact_oid);
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+-- Test that writes to Shared Buffers are tracked in pg_stat_io
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+CREATE TABLE test_io_shared_writes(a int);
+INSERT INTO test_io_shared_writes SELECT i FROM generate_series(1,100)i;
+CHECKPOINT;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_shared_writes;
+-- Test that extends of temporary tables are tracked in pg_stat_io
+CREATE TEMPORARY TABLE test_io_local_extends(a int);
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+INSERT INTO test_io_local_extends VALUES(1);
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs.
+CREATE TABLE test_io_strategy_stats(a INT, b INT);
+ALTER TABLE test_io_strategy_stats SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy_stats SELECT i, i from generate_series(1,8000)i;
+-- Ensure that the next VACUUM will need to perform IO
+VACUUM (FULL) test_io_strategy_stats;
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+VACUUM (PARALLEL 0) test_io_strategy_stats;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+ ?column? 
+----------
+ t
+(1 row)
+
+DROP TABLE test_io_strategy_stats;
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 3f3cf8fb56..fbd3977605 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -285,4 +285,38 @@ SELECT pg_stat_get_live_tuples(:drop_stats_test_subxact_oid);
 
 DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
 DROP TABLE prevstats;
+
+-- Test that writes to Shared Buffers are tracked in pg_stat_io
+SELECT sum(write) AS io_sum_shared_writes_before FROM pg_stat_io WHERE io_path = 'Shared' \gset
+CREATE TABLE test_io_shared_writes(a int);
+INSERT INTO test_io_shared_writes SELECT i FROM generate_series(1,100)i;
+CHECKPOINT;
+SELECT pg_stat_force_next_flush();
+SELECT sum(write) AS io_sum_shared_writes_after FROM pg_stat_io WHERE io_path = 'Shared' \gset
+SELECT :io_sum_shared_writes_after > :io_sum_shared_writes_before;
+DROP TABLE test_io_shared_writes;
+
+-- Test that extends of temporary tables are tracked in pg_stat_io
+CREATE TEMPORARY TABLE test_io_local_extends(a int);
+SELECT sum(extend) AS io_sum_local_extends_before FROM pg_stat_io WHERE io_path = 'Local' \gset
+INSERT INTO test_io_local_extends VALUES(1);
+SELECT pg_stat_force_next_flush();
+SELECT sum(extend) AS io_sum_local_extends_after FROM pg_stat_io WHERE io_path = 'Local' \gset
+SELECT :io_sum_local_extends_after > :io_sum_local_extends_before;
+
+-- Test that, when using a Strategy, reusing buffers from the Strategy ring
+-- count as "Strategy" allocs.
+CREATE TABLE test_io_strategy_stats(a INT, b INT);
+ALTER TABLE test_io_strategy_stats SET (autovacuum_enabled = 'false');
+INSERT INTO test_io_strategy_stats SELECT i, i from generate_series(1,8000)i;
+-- Ensure that the next VACUUM will need to perform IO
+VACUUM (FULL) test_io_strategy_stats;
+SELECT sum(alloc) AS io_sum_strategy_allocs_before FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+VACUUM (PARALLEL 0) test_io_strategy_stats;
+SELECT pg_stat_force_next_flush();
+SELECT sum(alloc) AS io_sum_strategy_allocs_after FROM pg_stat_io WHERE io_path = 'Strategy' \gset
+SELECT :io_sum_strategy_allocs_after > :io_sum_strategy_allocs_before;
+DROP TABLE test_io_strategy_stats;
+
+
 -- End of Stats Test
-- 
2.34.1

Reply via email to