On 07/04/2026 16:47, Andres Freund wrote:
On 2026-04-07 16:07:45 +0300, Heikki Linnakangas wrote:
On 28/03/2026 06:18, Ashutosh Bapat wrote:
Parallely myself and Palak Chaturvedi developed a quick patch to
modernise pg_buffercache_pages() and use tuplestore so that it doesn't
have to rely on NBuffers being the same between start of the scan,
when memory allocated, when the scan ends - a condition possible with
resizing buffer cache. It seems to improve the timings by about 10-30%
on my laptop for 128MB buffercache size. Without this patch the time
taken to execute Lukas's query varies between 10-15ms on my laptop.
With this patch it varies between 8-9ms. So the timing is more stable
as a side effect. It's not a 10x improvement that we are looking for
but it looks like a step in the right direction. That improvement
seems to come purely because we avoid creating a heap tuple. I wonder
if there are some places up in the execution tree where full
heaptuples get formed again instead of continuing to use minimal
tuples or places where we perform some extra actions that are not
required.

I don't think that's the reason for the improvement - tuplestore_putvalues()
forms a minimal tuple, and the cost to form a minimal tuple and a heap tuple
aren't meaningfully different.

Yeah, I wasn't fully convinced of that part either, which is why I left it out of the commit message. I mostly wanted to get rid of the double-buffering where we first accumulated all the data in an array.

I think the problem is that we materialize rowmode SRFs as a tuplestore if
they are in the from list.  You can easily see this even with just
generate_series():

postgres[1520825][1]=# SELECT count(*) FROM generate_series(1, 1000000);
┌─────────┐
│  count  │
├─────────┤
│ 1000000 │
└─────────┘
(1 row)

Time: 117.939 ms
postgres[1520825][1]=# SELECT count(*) FROM (SELECT generate_series(1, 
1000000));
┌─────────┐
│  count  │
├─────────┤
│ 1000000 │
└─────────┘
(1 row)

Time: 58.914 ms

Oh, to be honest I didn't remember that we *don't* materialize when it's in the target list.

Of course, because pg_buffercache_pages() is archaicially defined without
defininig its output columns, you can't actually use it in the select list.

But that can be fixed:

CREATE FUNCTION pg_buffercache_pages_fast(OUT bufferid integer, OUT relfilenode 
oid, OUT reltablespace oid, OUT reldatabase oid,
          OUT relforknumber int2, OUT relblocknumber int8, OUT isdirty bool, 
OUT usagecount int2,
          OUT pinning_backends int4)
RETURNS SETOF RECORD
AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
LANGUAGE C PARALLEL SAFE;

60GB of s_b, mostly filled, with 257c8231bf97a77378f6fedb826b1243f0a41612
reverted.

SELECT count(*) FROM (SELECT pg_buffercache_pages_fast());
Time: 1518.704 ms (00:01.519)

SELECT count(*) FROM pg_buffercache_pages_fast();
Time: 2008.101 ms (00:02.008)

Hmm, we could easily go back to ValuePerCall mode, while still getting rid of the temporary array and the other modernization. But "SELECT * FROM pg_buffercache_pages_fast()" doesn't actually produce the result we want:

postgres=# SELECT pg_buffercache_pages_fast() limit 1;
 pg_buffercache_pages_fast
---------------------------
 (1,1262,1664,0,0,0,f,5,0)
(1 row)

Can we turn that into the right shaep for the pg_buffercache view? This works:

postgres=# SELECT (pg_buffercache_pages_fast()).* limit 1;
bufferid | relfilenode | reltablespace | reldatabase | relforknumber | relblocknumber | isdirty | usagecount | pinning_backends
----------+-------------+---------------+-------------+---------------+----------------+---------+------------+------------------
1 | 1262 | 1664 | 0 | 0 | 0 | f | 5 | 0
(1 row)

But that doesn't seem to be faster anymore, despite avoiding the tuplestore, because of overheads elsewhere in the executor.

I didn't dig into the history to find out why we didn't modernize
pg_buffercache_pages(). I don't see any hazard though.

Committed this modernization patch, thanks!

It would be nice to have a proper row-at-a-time mode that would avoid
materializing the result, but collecting all the data in a temporary array
is clearly worse than just putting them to the tuplestore directly. The only
reason I can think of why we'd prefer to use a temporary array like that is
to get a more consistent snapshot of all the buffers, by keeping the time
spent scanning the buffers as short as possible. But we're not getting a
consistent view anyway, it's just a matter of degree.

Seems like a reasonably large difference in degree whether you have a snapshot
collected in one loop, or you do things like spilling a tuplestore to disk in
between.

Looking at the original discussion when pg_buffercache was introduced [1], the first patch version didn't have the array, but it was added in v2 to avoid holding BufMappingLock for long. But we gave up on getting a consistent snapshot and stopped holding the lock in commit 6e654546fb.

To summarize my current thinking, I think this is fine as committed. I'm not worried about the more "stretched out" snapshot that you get. And it would be nice if we had a better, faster ValuePerCall mode that also worked with FunctionScans, but we don't.

[1] https://www.postgresql.org/message-id/42297D6E.3000505%40coretech.co.nz

- Heikki



Reply via email to