get_actual_variable_range() scans an index to find actual min/max values for selectivity estimation. Since this happens during planning, we can't afford to spend too much time on it. We use VISITED_PAGES_LIMIT (100 heap pages) as a threshold -- once we've visited that many heap pages, we give up and fall back to pg_statistic values. See commit 9c6ad5eaa9 for background information.
Recent benchmark results from Mark Callaghan [1] show that get_actual_variable_range isn't very effective, once the problem with dead index tuples really starts to get out of hand. This is expected with queue-like tables that continually have older records deleted as newer ones are inserted. Right now this is the Achilles' heel for Postgres when it runs on Mark's insert benchmark. It's also likely to be a problem for TPC-C's order table. Tables that look like that are reasonably common in real world workloads IME. (My work on this problem is also related to the index prefetching patch, which makes API revisions that aren't really compatible with the current VISITED_PAGES_LIMIT design. More on that later.) Problem ======= VISITED_PAGES_LIMIT counts heap page visits. But when there are many index tuples marked LP_DEAD, btgettuple will traverse indefinitely-many *index* pages before returning even a single tuple to the selfuncs.c caller. The selfuncs.c heap page counter never gets a chance to increment. Profiling by Mark shows that we waste quite a lot of CPU cycles in nbtree functions like _bt_readpage when this happens. In practice there is no *natural limit* on the amount of work that we can do during any one call to get_actual_variable_range -- VISITED_PAGES_LIMIT doesn't consider costs related to reading index leaf pages at all. The design of get_actual_variable_range specifically aims to set LP_DEAD bits, with the goal of helping future calls to get_actual_variable_range avoid heap fetches. But there's an important sense in which this hurts rather than helps: each LP_DEAD tuple is one less that can be counted against VISITED_PAGES_LIMIT going forward. The more LP_DEAD bits we set, the less effective VISITED_PAGES_LIMIT becomes at bailing out early during future calls to get_actual_variable_range. Inevitably, the cost of reading index leaf pages eventually becomes the bottleneck -- at least past some tipping point. That effect seems perverse -- shouldn't setting LP_DEAD bits in index tuples be strictly a good thing? I suspect that VISITED_PAGES_LIMIT is fundamentally addressing the problem in the wrong way. Even without VISITED_PAGES_LIMIT, or anything like it, it is reasonable to expect that get_actual_variable_range will require a very small, fixed amount of work to find an extremal value in the vast majority of cases. After all, we only need to locate one non-dead extremal index tuple, and then we're done. If we get anywhere near VISITED_PAGES_LIMIT, then it's already reasonable to suppose that we're dealing with a pathological workload -- a workload exactly like the one that Mark's testing ran into. ISTM that Mark has demonstrated that VISITED_PAGES_LIMIT doesn't work particularly well in general. AFAICT there is nothing special about Mark's complaint -- it's not some special case that the VISITED_PAGES_LIMIT design didn't directly consider. Any case where get_actual_variable_range does many heap fetches is likely to look rather similar. Again, we only need to find one non-dead index tuple! If we can't find a suitable non-dead tuple on the first index page we look at, then why should we find one on the second or the third? Proposed solution ================= We really should use an approach that makes a *hard guarantee* that the costs that we pay will not exceed some known threshold (whether those costs are from scanning index pages or scanning heap pages). If we completely ignore costs paid by the index AM layer, then the problem just shifts to that layer before too long. Rather than counting heap page visits, I propose limiting the scan to the extremal leaf page only. If the extremal index leaf page yields no visible tuples, we give up immediately. The latest version of the index prefetching patch [2] adds a WIP patch that does just that. Testing of the patch posted to that other thead today shows that this is much more effective. For example, with a 5 million row table, the latency profile of running EXPLAIN about 66,000 times works out as follows (with no dead tuples): Baseline (no dead tuples): Metric Master Patch Ratio ---------- ------------ ------------ ---------- Avg 0.081 0.081 0.997x p95 0.096 0.095 0.993x p99 0.108 0.111 1.028x Unsurprisingly, the profile for the patch looks very similar to that of master. But when my test script deletes 20% of the tuples in the table, and repeats the same process over, things look very different: Main (with bulk-deleted tuples, concurrent inserts/deletes): Metric Master Patch Ratio ---------- ------------ ------------ ---------- Avg 0.933 0.080 0.085x p95 1.167 0.105 0.090x p99 1.195 0.133 0.112x The patch is a little over 10x faster than master now. More importantly, the patch performs very much like it did the first time around, before we deleted any index tuples. We have a much more meaningful guarantee about how bad things can get in the worst case. I don't think that we need to consider heap visits; just limiting the scan to a single leaf page seems to be sufficient. I'm open to the idea of giving some consideration to the number of heap fetches required, if testing can show that we still need some of that/we need to be clevered. But I'm sure that we absolutely need to pay attention to the cost of reading index leaf pages to do well here. Note on index prefetching patch =============================== As I touched on briefly already, and went into in detail on the index prefetching patch thread, the current VISITED_PAGES_LIMIT design is hard to make work the API revisions proposed by the index prefetching patch. A recap on the problems it creates for the index prefetching patch seems in order: The index prefetching patch adds a new table_index_getnext_slot() interface, rather like the current index_getnext_slot -- we need to use that from selfuncs.c now (not the low-level index_getnext_tid + index_fetch_heap interfaces). All visibility map lookups happen on the table AM/heapam side with this new interface, which is a more modular design. But it leaves selfuncs.c without any way to implement ad-hoc counting of heap page accesses to enforce VISITED_PAGES_LIMIT; all of the relevant heapam implementation details are hidden from callers such as selfuncs.c. I could work around this problem by inventing a new way to enforce VISITED_PAGES_LIMIT that lives on the table AM side. But that doesn't seem very appealing. We have to accomodate selfuncs.c's requirements in lower-level code either way, it seems, so we might as well add that mechanism to nbtree directly. Since it seems to fix the existing problems with get_actual_variable_range. Thoughts? [1] https://smalldatum.blogspot.com/2026/01/cpu-bound-insert-benchmark-vs-postgres.html [2] https://postgr.es/m/cah2-wznv9_kgqhq1vcw2pkia6qskbgcx5nc_-uxnd6geqas...@mail.gmail.com -- Peter Geoghegan
