On Sat, Mar 28, 2026 at 9:12 AM Ashutosh Bapat
<[email protected]> wrote:
>
> On Sat, Mar 28, 2026 at 11:07 AM Masahiko Sawada <[email protected]>
> wrote:
> >
> > On Fri, Mar 27, 2026 at 3:58 PM Tomas Vondra <[email protected]> wrote:
> > >
> > > On 3/26/26 05:21, Ashutosh Bapat wrote:
> > >
> > > The main argument here seems to be the performance, and the initial
> > > message demonstrates a 10x speedup (2ms vs. 20ms) on a cluster with
> > > 128MB shared buffers. Unless I misunderstood what config it uses.
> > >
> > > I gave it a try on an azure VM with 32GB shared buffers, to make it a
> > > bit more realistic, and my timings are 10ms vs. 700ms. But I also wonder
> > > if the original timings really were from a cluster with 128MB, because
> > > for me that shows 0.3ms vs. 3ms (so an order of magnitude faster than
> > > what was reported). But I suppose that's also hw specific.
> > >
> > > Nevertheless, it is much faster. I haven't profiled this but I assume
> > > it's thanks to not having to write the entries into a tuplestore (and
> > > possibly into a tempfile).
> > >
> > > But is it actually needed / worth it? I wonder what timings does Lukas
> > > observe when running this on larger clusters. Because in a later email
> > > he says:
> > >
> > > ... we currently run this on a 10 minute schedule when enabled, and
> > > that seems to work in terms of understanding large swings in cache
> > > contents.
> > >
> > > I'm all in for optimizing stuff, but if you're running a monitoring task
> > > every 10 minutes, does it matter if it's running for 1 or 5 seconds? I
> > > find that a bit hard to believe.
> >
> > I imagined such a query is just one of many monitoring queries running
> > concurrently, so the cumulative overhead can still matter.
> >
>
> What kind of cumulative overhead, do you see? Reduced TPS, increased
> memory/CPU consumption? I think itd will be good to see some metric
> evidence of this, rather than relying on the assumption.
On my part, the overhead that I've specifically seen in the field,
besides CPU utilization (which isn't great, but could be worse) is
temporary file use for large shared_buffers, due to writing out one
row to the tuplestore per buffer entry.
Here is an example from a production database, running Postgres 16
with 200GB shared_buffers:
SHOW shared_buffers;
shared_buffers
----------------
207873040kB
(1 row)
EXPLAIN (ANALYZE, BUFFERS) SELECT reldatabase, relfilenode, count(*)
FROM pg_buffercache
WHERE reldatabase IS NOT NULL
GROUP BY 1, 2;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=17.46..19.46 rows=200 width=16) (actual
time=12999.311..12999.526 rows=1683 loops=1)
Group Key: p.reldatabase, p.relfilenode
Batches: 1 Memory Usage: 209kB
Buffers: temp read=158595 written=158595
I/O Timings: temp read=246.513 write=1013.565
-> Function Scan on pg_buffercache_pages p (cost=0.00..10.00
rows=995 width=8) (actual time=5403.944..8864.129 rows=25984130
loops=1)
Filter: (reldatabase IS NOT NULL)
Buffers: temp read=158595 written=158595
I/O Timings: temp read=246.513 write=1013.565
Planning:
Buffers: shared hit=5
Planning Time: 0.101 ms
Execution Time: 13200.972 ms
(13 rows)
In this case we used a ~1.2GB temporary file to write out 25 million
rows, what could have been a ~100kb allocation in memory instead (~40
bytes BufferRelStatsEntry in v2 * 2048 slots in simplehash).
Thanks,
Lukas
--
Lukas Fittl