On 3/26/26 05:21, Ashutosh Bapat wrote:
> On Wed, Mar 25, 2026 at 10:19 PM Masahiko Sawada <[email protected]> 
> wrote:
>>
>> On Tue, Mar 24, 2026 at 11:47 PM Lukas Fittl <[email protected]> wrote:
>>>
>>> Hi Ashutosh,
>>>
>>> On Tue, Mar 24, 2026 at 11:24 PM Ashutosh Bapat
>>> <[email protected]> wrote:
>>>> I know we already have a couple of hand-aggregation functions but I am
>>>> hesitant to add more of these. Question is where do we stop? For
>>>> example, the current function is useless if someone wants to find the
>>>> parts of a relation which are hot since it doesn't include page
>>>> numbers. Do we write another function for the same? Or we add page
>>>> numbers to this function and then there's hardly any aggregation
>>>> happening. What if somebody wanted to perform an aggregation more
>>>> complex than just count() like average number of buffers per relation
>>>> or distribution of relation buffers in the cache, do they write
>>>> separate functions?
>>>
>>> I think the problem this solves for, which is a very common question I
>>> hear from end users, is "how much of this table/index is in cache" and
>>> "was our query slow because the cache contents changed?".
>>>
>>> It can't provide a perfect answer to all questions regarding what's in
>>> the cache (i.e. it won't tell you which part of the table is cached),
>>> but its in line with other statistics we do already provide in
>>> pg_stat_user_tables etc., which are all aggregate counts, not further
>>> breakdowns.
>>>
>>> Its also a reasonable compromise on providing something usable that
>>> can be shown on dashboards, as I've seen in collecting this
>>> information using the existing methods from small production systems
>>> in practice over the last ~1.5 years.
>>
>> Regarding the proposed statistics, I find them reasonably useful for
>> many users. I'm not sure we need to draw a strict line on what belongs
>> in the module. If a proposed function does exactly what most
>> pg_buffercache users want or are already writing themselves, that is
>> good enough motivation to include it.
>>
>> I think pg_visibility is a good precedent here. In that module, we
>> have both pg_visibility_map() and pg_visibility_map_summary(), even
>> though we can retrieve the exact same results as the latter by simply
>> using the former:
>>
>> select sum(all_visible::int), sum(all_frozen::int) from
>> pg_visibility_map('test') ;
>>
> 
> A summary may still be ok, but this proposal is going a bit farther,
> it's grouping by one subset which should really be done by GROUP BY in
> SQL. And I do
> 
> I am afraid that at some point, we will start finding all of these to
> be a maintenance burden. At that point, removing them will become a
> real pain for the backward compatibility reason. For example
> 1. The proposed function is going to add one more test to an already
> huge testing exercise for shared buffers resizing.
> 2. If we change the way to manage buffer cache e.g. use a tree based
> cache instead of hash + array cache, each of the functions which
> traverses the buffer cache array is going to add work - adjusting it
> to the new data structure - and make a hard project even harder. In
> this case we have other ways to get the summary, so the code level
> scan of buffer cache is entirely avoidable.
> 
> If I am the only one opposing it, and there are more senior
> contributors in favour of adding this function, we can accept it.
> 

I understand this argument - we have SQL, which allows us to process the
data in a flexible way, without hard-coding all interesting groupings.
The question is whether this particular grouping is special enough to
warrant a custom *faster* function.

The main argument here seems to be the performance, and the initial
message demonstrates a 10x speedup (2ms vs. 20ms) on a cluster with
128MB shared buffers. Unless I misunderstood what config it uses.

I gave it a try on an azure VM with 32GB shared buffers, to make it a
bit more realistic, and my timings are 10ms vs. 700ms. But I also wonder
if the original timings really were from a cluster with 128MB, because
for me that shows 0.3ms vs. 3ms (so an order of magnitude faster than
what was reported). But I suppose that's also hw specific.

Nevertheless, it is much faster. I haven't profiled this but I assume
it's thanks to not having to write the entries into a tuplestore (and
possibly into a tempfile).

But is it actually needed / worth it? I wonder what timings does Lukas
observe when running this on larger clusters. Because in a later email
he says:

   ... we currently run this on a 10 minute schedule when enabled, and
   that seems to work in terms of understanding large swings in cache
   contents.

I'm all in for optimizing stuff, but if you're running a monitoring task
every 10 minutes, does it matter if it's running for 1 or 5 seconds? I
find that a bit hard to believe.

Let's assume it's worth it. I wonder what similar summaries might be
interesting for users. I'd probably want to see a per-database summary,
especially on a shared / multi-tenant cluster. But AFAICS I can
calculate that from the pg_buffercache_relations() result, except that
I'll have to recalculate the usagecount.

I don't have clear opinion if we should do this. I kinda doubt it's a
significant maintenance burden. It'd add one more place the patch for
on-line resizing of shared buffers needs to worry about. Surely that
should not be very difficult, considering there are ~5 other places in
this very extension doing this already?

One thing we lose by doing ad hoc aggregation (instead of just relying
on the regular SQL aggregation operators) is lack of memory limit.
There's a simple in-memory hash table, no spilling to disk etc. The
simple pg_buffercache view does not have this issue, because the
tuplestore will spill to disk after hitting work_mem. Simplehash won't.

The entries are ~48B, so there would need to be buffers for ~100k
(relfilenode,forknum) combinations to overflow 4MB. It's not very
common, but I've seen systems with more relations that this. Would be
good to show some numbers showing it's not an issue.


A couple minor comments about the code:

1) Isn't this check unnecessary? All entries should have buffers > 0.

    if (entry->buffers == 0)
        continue;

2) Shouldn't this check BM_TAG_VALID too? Or is BM_VALID enough to look
at the bufHdr->tag?

    /* Skip unused/invalid buffers */
    if (!(buf_state & BM_VALID))
        continue;

3) I think "buffers" argument should be renamed to "buffers_unused" for
consistency with pg_buffercache_summary.


Overall, I'm -0.1 on this. I'm not opposed to doing this, but I'm also
not quite convinced it's worth it.


regards

-- 
Tomas Vondra



Reply via email to