On Fri, Mar 27, 2026 at 3:58 PM Tomas Vondra <[email protected]> wrote: > > On 3/26/26 05:21, Ashutosh Bapat wrote: > > On Wed, Mar 25, 2026 at 10:19 PM Masahiko Sawada <[email protected]> > > wrote: > >> > >> On Tue, Mar 24, 2026 at 11:47 PM Lukas Fittl <[email protected]> wrote: > >>> > >>> Hi Ashutosh, > >>> > >>> On Tue, Mar 24, 2026 at 11:24 PM Ashutosh Bapat > >>> <[email protected]> wrote: > >>>> I know we already have a couple of hand-aggregation functions but I am > >>>> hesitant to add more of these. Question is where do we stop? For > >>>> example, the current function is useless if someone wants to find the > >>>> parts of a relation which are hot since it doesn't include page > >>>> numbers. Do we write another function for the same? Or we add page > >>>> numbers to this function and then there's hardly any aggregation > >>>> happening. What if somebody wanted to perform an aggregation more > >>>> complex than just count() like average number of buffers per relation > >>>> or distribution of relation buffers in the cache, do they write > >>>> separate functions? > >>> > >>> I think the problem this solves for, which is a very common question I > >>> hear from end users, is "how much of this table/index is in cache" and > >>> "was our query slow because the cache contents changed?". > >>> > >>> It can't provide a perfect answer to all questions regarding what's in > >>> the cache (i.e. it won't tell you which part of the table is cached), > >>> but its in line with other statistics we do already provide in > >>> pg_stat_user_tables etc., which are all aggregate counts, not further > >>> breakdowns. > >>> > >>> Its also a reasonable compromise on providing something usable that > >>> can be shown on dashboards, as I've seen in collecting this > >>> information using the existing methods from small production systems > >>> in practice over the last ~1.5 years. > >> > >> Regarding the proposed statistics, I find them reasonably useful for > >> many users. I'm not sure we need to draw a strict line on what belongs > >> in the module. If a proposed function does exactly what most > >> pg_buffercache users want or are already writing themselves, that is > >> good enough motivation to include it. > >> > >> I think pg_visibility is a good precedent here. In that module, we > >> have both pg_visibility_map() and pg_visibility_map_summary(), even > >> though we can retrieve the exact same results as the latter by simply > >> using the former: > >> > >> select sum(all_visible::int), sum(all_frozen::int) from > >> pg_visibility_map('test') ; > >> > > > > A summary may still be ok, but this proposal is going a bit farther, > > it's grouping by one subset which should really be done by GROUP BY in > > SQL. And I do > > > > I am afraid that at some point, we will start finding all of these to > > be a maintenance burden. At that point, removing them will become a > > real pain for the backward compatibility reason. For example > > 1. The proposed function is going to add one more test to an already > > huge testing exercise for shared buffers resizing. > > 2. If we change the way to manage buffer cache e.g. use a tree based > > cache instead of hash + array cache, each of the functions which > > traverses the buffer cache array is going to add work - adjusting it > > to the new data structure - and make a hard project even harder. In > > this case we have other ways to get the summary, so the code level > > scan of buffer cache is entirely avoidable. > > > > If I am the only one opposing it, and there are more senior > > contributors in favour of adding this function, we can accept it. > > > > I understand this argument - we have SQL, which allows us to process the > data in a flexible way, without hard-coding all interesting groupings. > The question is whether this particular grouping is special enough to > warrant a custom *faster* function. > > The main argument here seems to be the performance, and the initial > message demonstrates a 10x speedup (2ms vs. 20ms) on a cluster with > 128MB shared buffers. Unless I misunderstood what config it uses. > > I gave it a try on an azure VM with 32GB shared buffers, to make it a > bit more realistic, and my timings are 10ms vs. 700ms. But I also wonder > if the original timings really were from a cluster with 128MB, because > for me that shows 0.3ms vs. 3ms (so an order of magnitude faster than > what was reported). But I suppose that's also hw specific. > > Nevertheless, it is much faster. I haven't profiled this but I assume > it's thanks to not having to write the entries into a tuplestore (and > possibly into a tempfile). > > But is it actually needed / worth it? I wonder what timings does Lukas > observe when running this on larger clusters. Because in a later email > he says: > > ... we currently run this on a 10 minute schedule when enabled, and > that seems to work in terms of understanding large swings in cache > contents. > > I'm all in for optimizing stuff, but if you're running a monitoring task > every 10 minutes, does it matter if it's running for 1 or 5 seconds? I > find that a bit hard to believe.
I imagined such a query is just one of many monitoring queries running concurrently, so the cumulative overhead can still matter. > I don't have clear opinion if we should do this. I kinda doubt it's a > significant maintenance burden. It'd add one more place the patch for > on-line resizing of shared buffers needs to worry about. Surely that > should not be very difficult, considering there are ~5 other places in > this very extension doing this already? Yeah, I've not looked at the online shared buffer resizing patch, but I hope that the patch somewhat abstructs the access to shared buffers that might be being resized so that we don't need to worry about the complex part when writing code accessing the shared buffers. > One thing we lose by doing ad hoc aggregation (instead of just relying > on the regular SQL aggregation operators) is lack of memory limit. > There's a simple in-memory hash table, no spilling to disk etc. The > simple pg_buffercache view does not have this issue, because the > tuplestore will spill to disk after hitting work_mem. Simplehash won't. > > The entries are ~48B, so there would need to be buffers for ~100k > (relfilenode,forknum) combinations to overflow 4MB. It's not very > common, but I've seen systems with more relations that this. Would be > good to show some numbers showing it's not an issue. Good point. I agree that we should not introduce the function in a way that there is a risk of using excessive memory while not respecting work_mem or other GUC parameters. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
