Re: pg_stat_io_histogram

Jakub Wartak Fri, 30 Jan 2026 05:44:18 -0800

On Thu, Jan 29, 2026 at 5:27 PM Andres Freund <[email protected]> wrote:
>
> Hi,
>
> On 2026-01-28 12:12:10 +0100, Jakub Wartak wrote:
> > On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <[email protected]>
> > > Not yet, I first wanted to hear if I'm not sailing into some plain stupid
> > > direction somewhere with this idea or implementation (e.g.
> > > that INSTR_TIME_GET_MICROSEC() was a really stupid omission from my side).
> > >
> > > I'll try to perform this test overhead measurement hopefully with v3 once
> > > we settle on how to do that bit shifting/clz().
> > >
> >
> > [..]
> > Here's the answer: on properly isolated perf test run (my
> > old&legacy&predictiable
> > 4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
> > warmup,
> > no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
> > VFS
> > cache)
> >     and started on just using single NUMA: numactl --membind=0
> > --cpunodebind=0
> >     measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S
> >
> > master+track_io_timings=on, 60s warmup and then 3x runs
> >     tps = 44615.603668
> >     tps = 44556.191492
> >     tps = 44813.793981
> >     avg = 44662
> >
> > master+track_io_timings=on+patch, , 60s warmup and then 3x runs
> >     tps = 44441.879384
> >     tps = 44403.101737
> >     tps = 45036.747418
> >     avg = 44627
> >
> > so that's like 99.921% (so literally no overhead) and yields picture like:
>
> I don't think that's a particularly useful assurance, unfortunately:
>
> 1) Using pgbench with an in-memory readonly workload is typically limited by
>    context switch overhead and per-statement overhead. After a short while you
>    have at most one IO per statement (the heap page), which obviously isn't
>    going to be affected by a small per-IO overhead.
>
> 2) The per-core memory bandwidth on that old machine, if it's the quite old
>    EDB machine I think it is, is so low, that you'd be bottlenecked by memory
>    bandwidth well before you're going to be bottlenecked by actual CPU stuff
>    (which the bucket computation is).


Hi, thanks for having a look!

That legacy server is mine, but yeah even on same NUMA it can just get
~16GB/s AFAIR and just ~4GB between nodes. I've forgot to reply in
that old NUMA thread
back then, maybe it's not relevant, but I find it valuable often as
the bottlenecks
are easier to hit/notice and there's not that many traps that modern CPUs have
(+ find having 4 NUMAs/socket in 2U is not that easy today, after all
it's physical
box nearby so PMCs are there too - unlike in "cloud", and by having 4 nodes the
disbalances between nodes/zones are much more cleanly visible than
just 1 local vs
1 remote). Somehow I built trust on the results that machine (but I
still can lift it!
so probably still shouldn't trust it fully - pun to the "never trust a
machine you
can lift" :D)

> I think you'd have to test something like pg_prewarm(), with
> io_combine_limit=1, on a modern *client* CPU (client CPUs typically have much
> higher per-core memory bandwidth than the more scalable server CPUs).

Fair point, thanks for explaining in the above chapter. So for much more modern
Intel(R) Ultra 7 (1s16c32t,that can - according to mlc(1) - can do up
to 85-90GB/s
bandwidth and has those ugly P/E-cores, so I've pinned (taskset) the backend
doing pg_prewam to P-Core @ that usually runs @ 5Ghz, but with no_turbo
that's just @ 1.4Ghz). That was on normal build (not debug, so -O2, no casserts,
gcc 13, kernel 6.14.x, no_turbo, s_b=8GB with HP, DB scale 600
(pgbench_accounts @
7.7GB), performance governor, no THP, ...)

I'was kind of surprised, but here it goes full disclosure of my results. With
the patch and track_io_timing=on, io_combine_limit=1, IO_METHOD='SYNC'
just to care
about 1 PID, here are timings of pg_buffercache_evict_all() and then
measured duration of
select pg_prewarm('pgbench_accounts');

So initially I've got this picture, eliminated some worst/best too:

Without patch:
Time: 4644.346 ms (00:04.644)
Time: 4612.610 ms (00:04.613)
Time: 4639.133 ms (00:04.639)
Time: 4625.020 ms (00:04.625)
Time: 4636.652 ms (00:04.637)
Avg: 4631ms

With the patch:
Time: 4765.780 ms (00:04.766)
Time: 4784.308 ms (00:04.784)
Time: 4754.661 ms (00:04.755)
Time: 4770.772 ms (00:04.771)
Time: 4768.232 ms (00:04.768)
Avg: 4768ms (102.95%)

With the patch and __builtin_clzl()
Time: 4750.293 ms (00:04.750)
Time: 4729.288 ms (00:04.729)
Time: 4727.820 ms (00:04.728)
Time: 4729.760 ms (00:04.730)
Time: 4727.146 ms (00:04.727)
Avg: 4732ms (102.18%)

So clearly there was some overhead (I've started getting worried),
and __builtin_clz() was cheaper slightly cheaper or just too much jitter --
yes I've got plenty jittering out with this (not-shown, so above are like 5
results out of 15).

With v2 patch and __builtin_clzl() and default track_io_timing=off (default)
got back to ~4660m as expected.

With v2 patch and __builtin_clzl() and default io_combine_limit=128kB
and track_io_timing=off, went back to ~4150ms:
Time: 4151.942 ms (00:04.152)
Time: 4133.747 ms (00:04.134)
Time: 4153.103 ms (00:04.153)
Time: 4135.199 ms (00:04.135)

With thje patch and __builtin_clzl() and default io_combine_limit=128kB
track_io_timing=on, was also @ ~4150ms.
Time: 4152.941 ms (00:04.153)
Time: 4154.096 ms (00:04.154)
Time: 4155.119 ms (00:04.155)

So with "batching" the IOs, the overhead is almost gone. BTW that's with
current_clocksource says "tsc". After dozens of runs, I've noticed thermals
starting playing a bigger role than this patch, so i've did idle-set -D0
and it degraded even further.

Master, but still got lots of fluctuations, non filtered picture
Time: 5518.546 ms (00:05.519)
Time: 5587.675 ms (00:05.588)
Time: 5512.828 ms (00:05.513)
Time: 5534.023 ms (00:05.534)
Time: 5728.125 ms (00:05.728)
Time: 5731.543 ms (00:05.732)
Time: 5762.687 ms (00:05.763)
Time: 5565.607 ms (00:05.566)
Time: 5498.496 ms (00:05.498)
Time: 5637.870 ms (00:05.638)

but if I leave it *idle* for a while couple minutes then I get:
Time: 5577.879 ms (00:05.578)
Time: 5575.648 ms (00:05.576)
Time: 5548.146 ms (00:05.548)

Some break and with the patch and __builtin_clzl (it gets lower sometimes than
master, how can I trust this?!)
Time: 5504.415 ms (00:05.504)
Time: 5531.827 ms (00:05.532)
Time: 5733.146 ms (00:05.733)
Time: 5511.549 ms (00:05.512)

So something more happening there , probably with thermals/scheduler than with
patch. So of course I've done some home work [1][2], I have found even
Your's rant on
some of this [1] and truth to be told I'm unable to stabilize those
deviations on
this __modern__ client CPU. So i've tried on another much more predictable
(and non-modern :P) client CPU: Intel Core i5 7600k (1s4c4t) and got much more
consistent numbers there (those are non-filtered, almost identical variables
from also same setup(also 6.14.x, same tweaks, also with taskset to 1c) ):

Master:
Time: 2592.351 ms (00:02.592)
Time: 2574.612 ms (00:02.575)
Time: 2592.530 ms (00:02.593)
Time: 2575.356 ms (00:02.575)
Time: 2594.687 ms (00:02.595)
Avg=2585ms

Master+patch:
Time: 2577.610 ms (00:02.578)
Time: 2585.796 ms (00:02.586)
Time: 2568.559 ms (00:02.569)
Time: 2586.199 ms (00:02.586)
Time: 2567.872 ms (00:02.568)
Avg=2576ms (below master?!)

Master+patch__builtin_clzl:
Time: 2578.083 ms (00:02.578)
Time: 2586.732 ms (00:02.587)
Time: 2573.176 ms (00:02.573)
Time: 2592.048 ms (00:02.592)
Time: 2575.731 ms (00:02.576)
Time: 2575.570 ms (00:02.576)
Avg=2579ms (below master?!)

Just Master again:
Time: 2578.838 ms (00:02.579)
Time: 2588.531 ms (00:02.589)
Time: 2572.165 ms (00:02.572)
Time: 2591.528 ms (00:02.592)
Time: 2572.015 ms (00:02.572)
Time: 2589.921 ms (00:02.590)
Time: 2572.124 ms (00:02.572)
Avg=2580ms

So to sum-up:
- it still looks OK to me
- bigger impact than the patches itself can be thermals on modern-day CPUs(?)
- older/legacy CPU (desktop one) seems to be less jittering than modern client
  laptop CPU even with the most strict perf. settings (?)
- worst-case: to spot that ~2% regression one would have to disable
the io batching,
  enable track_io_timing (that's the not default)

I'm attaching v3 which has now default switched to __builtin_clzl() which
works ok for uint64 (not sure if I need to care about __builtin_clzll
on Windows?).

Open questions:
0. Should I pursue more benchmarking or the above results are enough?
1. __builtin_clzl() or not to __builtin_clzl() that is the question... ?
2. Should I add per-PID backend stats too or are you having something
against it?
3. Shouldn't we fix that mdsyncfiletag() mentioned earlier we seem to have
   pgstat_count_io_op_time() *after* potential FileClose() (as per my
earlier question)

-J.

[1] - 
https://www.postgresql.org/message-id/20231115180433.p3eeaczbam5zxdz5%40awork3.anarazel.de
[2] - https://vondra.me/posts/benchmarking-is-hard-sometimes/

From 419b8698108bc3776b712d9e51823ef60315d434 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 23 Jan 2026 08:10:09 +0100
Subject: [PATCH v3] Add pg_stat_io_histogram view to provide more detailed
 insight into IO profile

pg_stat_io_histogram displays a histogram of IO latencies for specific
backend_type, object, context and io_type. The histogram has buckets that allow
faster identification of I/O latency outliers due to faulty hardware and/or
misbehaving I/O stack. Such I/O outliers e.g. slow fsyncs could sometimes
cause intermittent issues e.g. for COMMIT or affect the synchronous standbys
performance.

Author: Jakub Wartak <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by:
Discussion: https://postgr.es/m/CAKZiRmwvE4uJLKTgPXeBA4m%2Bd4tTghayoefcaM9%3Dz3_S7i72GA%40mail.gmail.com
---
 doc/src/sgml/config.sgml               |  12 +-
 doc/src/sgml/monitoring.sgml           | 270 ++++++++++++++++++++++++-
 doc/src/sgml/wal.sgml                  |   5 +-
 src/backend/catalog/system_views.sql   |  11 +
 src/backend/utils/activity/pgstat_io.  |   0
 src/backend/utils/activity/pgstat_io.c |  97 +++++++++
 src/backend/utils/adt/pgstatfuncs.c    | 146 +++++++++++++
 src/include/catalog/pg_proc.dat        |   9 +
 src/include/pgstat.h                   |  14 ++
 src/include/port/pg_bitutils.h         |  28 +++
 src/test/regress/expected/rules.out    |   8 +
 src/test/regress/expected/stats.out    |  23 +++
 src/test/regress/sql/stats.sql         |  15 ++
 13 files changed, 631 insertions(+), 7 deletions(-)
 create mode 100644 src/backend/utils/activity/pgstat_io.

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5560b95ee60..614af387ee5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8729,9 +8729,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         displayed in <link linkend="monitoring-pg-stat-database-view">
         <structname>pg_stat_database</structname></link>,
         <link linkend="monitoring-pg-stat-io-view">
-        <structname>pg_stat_io</structname></link> (if <varname>object</varname>
-        is not <literal>wal</literal>), in the output of the
-        <link linkend="pg-stat-get-backend-io">
+        <structname>pg_stat_io</structname></link> and
+        <link linkend="monitoring-pg-stat-io-histogram-view">
+        <structname>pg_stat_io_histogram</structname></link>
+        (if <varname>object</varname> is not <literal>wal</literal>),
+        in the output of the <link linkend="pg-stat-get-backend-io">
         <function>pg_stat_get_backend_io()</function></link> function (if
         <varname>object</varname> is not <literal>wal</literal>), in the
         output of <xref linkend="sql-explain"/> when the <literal>BUFFERS</literal>
@@ -8761,7 +8763,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         measure the overhead of timing on your system.
         I/O timing information is displayed in
         <link linkend="monitoring-pg-stat-io-view">
-        <structname>pg_stat_io</structname></link> for the
+        <structname>pg_stat_io</structname></link> and
+        <link linkend="monitoring-pg-stat-io-histogram-view">
+        <structname>pg_stat_io_histogram</structname></link> for the
         <varname>object</varname> <literal>wal</literal> and in the output of
         the <link linkend="pg-stat-get-backend-io">
         <function>pg_stat_get_backend_io()</function></link> function for the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 88450facebd..f8ce9f91ba2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -493,6 +493,17 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io_histogram</structname><indexterm><primary>pg_stat_io_histogram</primary></indexterm></entry>
+      <entry>
+       One row for each combination of backend type, context, target object,
+       IO operation type and latency bucket (in microseconds) containing
+       cluster-wide I/O statistics.
+       See <link linkend="monitoring-pg-stat-io-histogram-view">
+       <structname>pg_stat_io_histogram</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_replication_slots</structname><indexterm><primary>pg_stat_replication_slots</primary></indexterm></entry>
       <entry>One row per replication slot, showing statistics about the
@@ -690,7 +701,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The <structname>pg_stat_io</structname> and
-   <structname>pg_statio_</structname> set of views are useful for determining
+   <structname>pg_statio_histogram</structname> set of views are useful for determining
    the effectiveness of the buffer cache. They can be used to calculate a cache
    hit ratio. Note that while <productname>PostgreSQL</productname>'s I/O
    statistics capture most instances in which the kernel was invoked in order
@@ -699,6 +710,8 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    Users are advised to use the <productname>PostgreSQL</productname>
    statistics views in combination with operating system utilities for a more
    complete picture of their database's I/O performance.
+   Furthermore the <structname>pg_stat_io_histogram</structname> view can be helpful
+   identifing latency outliers for specific I/O operations.
   </para>
 
  </sect2>
@@ -3118,6 +3131,261 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-io-histogram-view">
+  <title><structname>pg_stat_io</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io_histogram</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io_histogram</structname> view will contain one row for each
+   combination of backend type, target I/O object, and I/O context, IO operation
+   type, bucket latency cluster-wide I/O statistics. Combinations which do not make sense
+   are omitted.
+  </para>
+
+  <para>
+   The view shows measured perceived I/O latency by the backend, not the kernel or device
+   one. This is important distinction when troubleshooting, as the I/O latency observed by
+   the backend might get affected by:
+   <itemizedlist>
+     <listitem>
+        <para>OS scheduler decisions and available CPU resources.</para>
+        <para>With AIO, it might include time to service other IOs from the queue. That will often inflate IO latency.</para>
+        <para>In case of writing, additional filesystem journaling operations.</para>
+     </listitem>
+  </itemizedlist>
+  </para>
+
+  <para>
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
+  </para>
+
+  <table id="pg-stat-io-histogram-view" xreflabel="pg_stat_io">
+   <title><structname>pg_stat_io_histogram</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker). See <link
+        linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information
+        on <varname>backend_type</varname>s. Some
+        <varname>backend_type</varname>s do not accumulate I/O operation
+        statistics and will not be included in the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>object</structfield> <type>text</type>
+       </para>
+       <para>
+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: Permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>temp relation</literal>: Temporary relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
+       </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal>: The default or standard
+          <varname>context</varname> for a type of I/O operation. For
+          example, by default, relation data is read into and written out from
+          shared buffers. Thus, reads and writes of relation data to and from
+          shared buffers are tracked in <varname>context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations. Temporary
+          table vacuums use the same local buffer pool as other temporary table
+          I/O operations and are tracked in <varname>context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal>: Certain large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal>: Certain large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of IO operation. Possible values are: FIXME!
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>bucket_latency_us</structfield> <type>int4range</type>
+       </para>
+       <para>
+        The latency bucket (in microseconds).
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>bucket_count</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Numer of times latency of the I/O operation hit this specific bucket (with
+        up to <varname>bucket_latency_us</varname> microseconds).
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some backend types never perform I/O operations on some I/O objects and/or
+   in some I/O contexts. These rows are omitted from the view. For example, the
+   checkpointer does not checkpoint temporary tables, so there will be no rows
+   for <varname>backend_type</varname> <literal>checkpointer</literal> and
+   <varname>object</varname> <literal>temp relation</literal>.
+  </para>
+
+  <para>
+   In addition, some I/O operations will never be performed either by certain
+   backend types or on certain I/O objects and/or in certain I/O contexts.
+   These cells will be NULL. For example, temporary tables are not
+   <literal>fsync</literal>ed, so <varname>fsyncs</varname> will be NULL for
+   <varname>object</varname> <literal>temp relation</literal>. Also, the
+   background writer does not perform reads, so <varname>reads</varname> will
+   be NULL in rows for <varname>backend_type</varname> <literal>background
+   writer</literal>.
+  </para>
+
+  <para>
+   For the <varname>object</varname> <literal>wal</literal>,
+   <varname>fsyncs</varname> and <varname>fsync_time</varname> track the
+   fsync activity of WAL files done in <function>issue_xlog_fsync</function>.
+   <varname>writes</varname> and <varname>write_time</varname>
+   track the write activity of WAL files done in
+   <function>XLogWrite</function>.
+   See <xref linkend="wal-configuration"/> for more information.
+  </para>
+
+  <para>
+   <structname>pg_stat_io_histogram</structname> can be used to identify
+   I/O storage issues
+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      Presence of abnormally high latency for <varname>fsyncs</varname> might
+      indicate I/O saturation, oversubscription or hardware connectivity issues.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Unusually high latency for <varname>fsyncs</varname> on standby's startup
+      backend type, might be responsible for high duration of commits in
+      synchronous replication setups.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <note>
+   <para>
+    Columns tracking I/O wait time will only be non-zero when
+    <xref linkend="guc-track-io-timing"/> is enabled. The user should be
+    careful when referencing these columns in combination with their
+    corresponding I/O operations in case <varname>track_io_timing</varname>
+    was not enabled for the entire time since the last stats reset.
+   </para>
+  </note>
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-bgwriter-view">
   <title><structname>pg_stat_bgwriter</structname></title>
 
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index f3b86b26be9..8b8c407e69f 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -832,8 +832,9 @@
    of times <function>XLogWrite</function> writes and
    <function>issue_xlog_fsync</function> syncs WAL data to disk are also
    counted as <varname>writes</varname> and <varname>fsyncs</varname>
-   in <structname>pg_stat_io</structname> for the <varname>object</varname>
-   <literal>wal</literal>, respectively.
+   in <structname>pg_stat_io</structname> and
+   <structname>pg_stat_io_histogram</structname> for the
+   <varname>object</varname> <literal>wal</literal>, respectively.
   </para>
 
   <para>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7553f31fef0..c44c4167646 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1219,6 +1219,17 @@ SELECT
        b.stats_reset
 FROM pg_stat_get_io() b;
 
+CREATE VIEW pg_stat_io_histogram AS
+SELECT
+       b.backend_type,
+       b.object,
+       b.context,
+       b.io_type,
+       b.bucket_latency_us,
+       b.bucket_count,
+       b.stats_reset
+FROM pg_stat_get_io_histogram() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/activity/pgstat_io. b/src/backend/utils/activity/pgstat_io.
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 28de24538dc..ebaa16ce239 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -107,6 +107,66 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+#if 1
+static inline int get_bucket_index(uint64_t ns) {
+#define MIN_PG_STAT_IO_HIST_LATENCY 8191
+	const uint32_t max_index = PGSTAT_IO_HIST_BUCKETS - 1;
+	/*
+	 * hopefully pre-calculated by the compiler:
+	 * clzl(8191) = clz(01111111111111b on uint64)
+	 */
+	const uint32_t min_latency_leading_zeros =
+		pg_leading_zero_bits64(MIN_PG_STAT_IO_HIST_LATENCY);
+
+	/*
+	 * make sure the tmp value has at least 8191 (our minimum bucket size)
+	 * as __builtin_clzl might return undefined behavior when operating on 0
+	 */
+	uint64_t tmp = ns | MIN_PG_STAT_IO_HIST_LATENCY;
+
+	/* count leading zeros */
+	int leading_zeros = pg_leading_zero_bits64(tmp);
+
+	/* normalize the index */
+	uint32_t index = min_latency_leading_zeros - leading_zeros;
+
+	/* clamp it to the maximum */
+	return (index > max_index) ? max_index : index;
+}
+#else
+/*
+ * Calculate bucket index based on value in microseconds. We have up to
+ * PGSTAT_IO_HIST_BUCKETS buckets.
+ *	If < 8 (2^3) -> bucket 0
+ *	If < 16 (2^4) -> bucket 1
+ *	..
+ */
+static inline int
+get_bucket_index(uint64_t ns) {
+	uint8_t bucket = 0, result;
+	uint64_t temp = ns;
+
+	/* Find Most Significant Bit (Log2) using manual shifts */
+	if (temp >= 65536) { temp >>= 16; bucket += 16; }
+	if (temp >= 256)   { temp >>= 8;  bucket += 8;  }
+	if (temp >= 16)    { temp >>= 4;  bucket += 4;  }
+	if (temp >= 4)     { temp >>= 2;  bucket += 2;  }
+	if (temp >= 2)     { temp >>= 1;  bucket += 1;  }
+
+#define MIN_BUCKET_POW 12
+	/* Clamp to our baseline of 8192ns = 2^13, so everything below should be
+	 * put into bucket 0.
+	 */
+	if (bucket <= MIN_BUCKET_POW)
+		return 0;
+	else
+		result = bucket - MIN_BUCKET_POW;
+
+	/* Finally ensure we are within proper 0-15 range */
+	return (result > PGSTAT_IO_HIST_BUCKETS-1) ? PGSTAT_IO_HIST_BUCKETS-1 : result;
+}
+#endif
+
 /*
  * Like pgstat_count_io_op() except it also accumulates time.
  *
@@ -125,6 +185,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 	if (!INSTR_TIME_IS_ZERO(start_time))
 	{
 		instr_time	io_time;
+		int bucket_index;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
@@ -152,6 +213,10 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
 
+		/* calculate the bucket_index based on latency in nanoseconds (uint64) */
+		bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
+		PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
+
 		/* Add the per-backend count */
 		pgstat_count_backend_io_op_time(io_object, io_context, io_op,
 										io_time);
@@ -221,6 +286,10 @@ pgstat_io_flush_cb(bool nowait)
 
 				bktype_shstats->times[io_object][io_context][io_op] +=
 					INSTR_TIME_GET_MICROSEC(time);
+
+				for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
+					bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
+						PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
 			}
 		}
 	}
@@ -274,6 +343,34 @@ pgstat_get_io_object_name(IOObject io_object)
 	pg_unreachable();
 }
 
+const char *
+pgstat_get_io_op_name(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evict";
+		case IOOP_FSYNC:
+			return "fsync";
+		case IOOP_HIT:
+			/* TODO: investigate this! */
+			return "hit?";
+		case IOOP_REUSE:
+			return "reuse";
+		case IOOP_WRITEBACK:
+			return "writeback";
+		case IOOP_EXTEND:
+			return "extend";
+		case IOOP_READ:
+			return "read";
+		case IOOP_WRITE:
+			return "write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
 void
 pgstat_io_init_shmem_cb(void *stats)
 {
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 73ca0bb0b7f..78efb78f5fb 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -18,9 +18,11 @@
 #include "access/xlog.h"
 #include "access/xlogprefetcher.h"
 #include "catalog/catalog.h"
+#include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
+//#include "fmgr.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -30,6 +32,7 @@
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/rangetypes.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)		 ((uint32)(*((volatile uint32 *)&(var))))
@@ -1627,6 +1630,149 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS)
 	return (Datum) 0;
 }
 
+/*
+* When adding a new column to the pg_stat_io_histogram view and the
+* pg_stat_get_backend_io() function, add a new enum value here above
+* HIST_IO_NUM_COLUMNS.
+*/
+typedef enum hist_io_stat_col
+{
+	HIST_IO_COL_INVALID = -1,
+	HIST_IO_COL_BACKEND_TYPE,
+	HIST_IO_COL_OBJECT,
+	HIST_IO_COL_CONTEXT,
+	HIST_IO_COL_IOTYPE,
+	HIST_IO_COL_BUCKET_US,
+	HIST_IO_COL_COUNT,
+	HIST_IO_COL_RESET_TIME,
+	HIST_IO_NUM_COLUMNS
+} histogram_io_stat_col;
+
+/*
+ * pg_stat_io_histogram_build_tuples
+ *
+ * Helper routine for pg_stat_get_io_histogram() and pg_stat_get_backend_io()XXX
+ * filling a result tuplestore with one tuple for each object and each
+ * context supported by the caller, based on the contents of bktype_stats.
+ */
+static void
+pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
+						PgStat_BktypeIO *bktype_stats,
+						BackendType bktype,
+						TimestampTz stat_reset_timestamp)
+{
+	/* Get OID for int4range type */
+	Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+	Oid			range_typid = TypenameGetTypid("int4range");
+	TypeCacheEntry *typcache = lookup_type_cache(range_typid, TYPECACHE_RANGE_INFO);
+
+	for (int io_obj = 0; io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+	{
+		const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *context_name = pgstat_get_io_context_name(io_context);
+
+			/*
+			 * Some combinations of BackendType, IOObject, and IOContext are
+			 * not valid for any type of IOOp. In such cases, omit the entire
+			 * row from the view.
+			 */
+			if (!pgstat_tracks_io_object(bktype, io_obj, io_context))
+				continue;
+
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				const char *op_name = pgstat_get_io_op_name(io_op);
+
+				for(int bucket = 0; bucket < PGSTAT_IO_HIST_BUCKETS; bucket++) {
+					Datum		values[HIST_IO_NUM_COLUMNS] = {0};
+					bool		nulls[HIST_IO_NUM_COLUMNS] = {0};
+					RangeBound	lower, upper;
+					RangeType	*range;
+
+					values[HIST_IO_COL_BACKEND_TYPE] = bktype_desc;
+					values[HIST_IO_COL_OBJECT] = CStringGetTextDatum(obj_name);
+					values[HIST_IO_COL_CONTEXT] = CStringGetTextDatum(context_name);
+					values[HIST_IO_COL_IOTYPE] = CStringGetTextDatum(op_name);
+
+					/* this bucket's max latency:  2^2 = MIN_BUCKET_POW = (us) */
+					if(bucket == 0)
+						lower.val = Int32GetDatum(0);
+					else
+						lower.val = Int32GetDatum(1 << (2 + bucket));
+					lower.infinite = false;
+					lower.inclusive = true;
+					lower.lower = true;
+
+					if(bucket == PGSTAT_IO_HIST_BUCKETS - 1)
+						upper.infinite = true;
+					else {
+						upper.val = Int32GetDatum(1 << (2 + bucket + 1));
+						upper.infinite = false;
+					}
+					upper.inclusive = true;
+					upper.lower = false;
+
+					range = make_range(typcache, &lower, &upper, false, NULL);
+					values[HIST_IO_COL_BUCKET_US] = RangeTypePGetDatum(range);
+
+					/* bucket count */
+					values[HIST_IO_COL_COUNT] = Int64GetDatum(
+						bktype_stats->hist_time_buckets[io_obj][io_context][io_op][bucket]);
+
+					if (stat_reset_timestamp != 0)
+						values[HIST_IO_COL_RESET_TIME] = TimestampTzGetDatum(stat_reset_timestamp);
+					else
+						nulls[HIST_IO_COL_RESET_TIME] = true;
+
+					tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+								 values, nulls);
+				}
+			}
+		}
+	}
+}
+
+Datum
+pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * In Assert builds, we can afford an extra loop through all of the
+		 * counters (in pg_stat_io_build_tuples()), checking that only
+		 * expected stats are non-zero, since it keeps the non-Assert code
+		 * cleaner.
+		 */
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		if (!pgstat_tracks_io_bktype(bktype))
+			continue;
+
+		/* save tuples with data from this PgStat_BktypeIO */
+		pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, bktype,
+								backends_io_stats->stat_reset_timestamp);
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * pg_stat_wal_build_tuple
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 894b6a1b6d6..c2ff31049cf 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6026,6 +6026,15 @@
   proargnames => '{backend_type,object,context,reads,read_bytes,read_time,writes,write_bytes,write_time,writebacks,writeback_time,extends,extend_bytes,extend_time,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
+{ oid => '6149', descr => 'statistics: per backend type IO latency histogram',
+  proname => 'pg_stat_get_io_histogram', prorows => '30', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record',
+  proargtypes => '',
+  proallargtypes => '{text,text,text,text,int4range,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,object,context,io_type,bucket_latency_us,bucket_count,stats_reset}',
+  prosrc => 'pg_stat_get_io_histogram' },
+
 { oid => '6386', descr => 'statistics: backend IO statistics',
   proname => 'pg_stat_get_backend_io', prorows => '5', proretset => 't',
   provolatile => 'v', proparallel => 'r', prorettype => 'record',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff7ecc2533..90e5b08acb9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -322,11 +322,23 @@ typedef enum IOOp
 	(((unsigned int) (io_op)) < IOOP_NUM_TYPES && \
 	 ((unsigned int) (io_op)) >= IOOP_EXTEND)
 
+/*
+ * This should represent balance between being fast and providing value
+ * to the users:
+ * 1. We want to cover various fast and slow device types (0.01ms - 15ms)
+ * 2. We want to also cover sporadic long tail latencies (hardware issues,
+ *    delayed fsyncs, stuck I/O)
+ * 3. We want to be aslow  below cacheline size here probably:
+ *    16 * sizeof(uint64) = which should be less than two cachelines.
+ */
+#define PGSTAT_IO_HIST_BUCKETS 16
+
 typedef struct PgStat_BktypeIO
 {
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	uint64		hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_PendingIO
@@ -334,6 +346,7 @@ typedef struct PgStat_PendingIO
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	uint64		pending_hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_PendingIO;
 
 typedef struct PgStat_IO
@@ -606,6 +619,7 @@ extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
+extern const char *pgstat_get_io_op_name(IOOp io_op);
 
 extern bool pgstat_tracks_io_bktype(BackendType bktype);
 extern bool pgstat_tracks_io_object(BackendType bktype,
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 35761f509ec..b963d4dde35 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -32,6 +32,34 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+
+/*
+ * pg_leading_zero_bits64
+ *		Returns the number of leading 0-bits in x, starting at the most significant bit position.
+ *		Word must not be 0 (as it is undefined behavior).
+ */
+static inline int
+pg_leading_zero_bits64(uint64 word)
+{
+#ifdef HAVE__BUILTIN_CLZ
+	Assert(word != 0);
+
+	return __builtin_clz(word);
+#else
+	int n = 64;
+	uint64 y;
+	if (word == 0)
+		return 64;
+
+	y = word >> 32; if (y != 0) { n -= 32; word = y; }
+	y = word >> 16; if (y != 0) { n -= 16; word = y; }
+	y = word >> 8;  if (y != 0) { n -= 8;  word = y; }
+	y = word >> 4;  if (y != 0) { n -= 4;  word = y; }
+	y = word >> 2;  if (y != 0) { n -= 2;  word = y; }
+	y = word >> 1;  if (y != 0) { return n - 2; }
+#endif
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f4ee2bd7459..88dc703bbde 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1951,6 +1951,14 @@ pg_stat_io| SELECT backend_type,
     fsync_time,
     stats_reset
    FROM pg_stat_get_io() b(backend_type, object, context, reads, read_bytes, read_time, writes, write_bytes, write_time, writebacks, writeback_time, extends, extend_bytes, extend_time, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
+pg_stat_io_histogram| SELECT backend_type,
+    object,
+    context,
+    io_type,
+    bucket_latency_us,
+    bucket_count,
+    stats_reset
+   FROM pg_stat_get_io_histogram() b(backend_type, object, context, io_type, bucket_latency_us, bucket_count, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index cd00f35bf7a..4c95f09d651 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1765,6 +1765,29 @@ SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset;
  t
 (1 row)
 
+-- Check that pg_stat_io_histograms sees some growing counts in buckets
+-- We could also try with checkpointer, but it often runs with fsync=off
+-- during test.
+SET track_io_timing TO 'on';
+SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram()
+WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
+CREATE TABLE test_io_hist(id bigint);
+INSERT INTO test_io_hist SELECT generate_series(1, 100) s;
+SELECT pg_stat_force_next_flush();
+ pg_stat_force_next_flush 
+--------------------------
+ 
+(1 row)
+
+SELECT sum(bucket_count) AS hist_bucket_count_sum2 FROM pg_stat_get_io_histogram()
+WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
+SELECT :hist_bucket_count_sum2 > :hist_bucket_count_sum;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET track_io_timing;
 -- Check invalid input for pg_stat_get_backend_io()
 SELECT pg_stat_get_backend_io(NULL);
  pg_stat_get_backend_io 
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 8768e0f27fd..063b1011d7e 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -841,6 +841,21 @@ SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) +
   FROM pg_stat_get_backend_io(pg_backend_pid()) \gset
 SELECT :my_io_stats_pre_reset > :my_io_stats_post_backend_reset;
 
+
+-- Check that pg_stat_io_histograms sees some growing counts in buckets
+-- We could also try with checkpointer, but it often runs with fsync=off
+-- during test.
+SET track_io_timing TO 'on';
+SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram()
+WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
+CREATE TABLE test_io_hist(id bigint);
+INSERT INTO test_io_hist SELECT generate_series(1, 100) s;
+SELECT pg_stat_force_next_flush();
+SELECT sum(bucket_count) AS hist_bucket_count_sum2 FROM pg_stat_get_io_histogram()
+WHERE backend_type='client backend' AND object='relation' AND context='normal' \gset
+SELECT :hist_bucket_count_sum2 > :hist_bucket_count_sum;
+RESET track_io_timing;
+
 -- Check invalid input for pg_stat_get_backend_io()
 SELECT pg_stat_get_backend_io(NULL);
 SELECT pg_stat_get_backend_io(0);
-- 
2.43.0

Re: pg_stat_io_histogram

Reply via email to