get_actual_variable_range() scans an index to find actual min/max
values for selectivity estimation. Since this happens during planning,
we can't afford to spend too much time on it. We use
VISITED_PAGES_LIMIT (100 heap pages) as a threshold -- once we've
visited that many heap pages, we give up and fall back to pg_statistic
values. See commit 9c6ad5eaa9 for background information.

Recent benchmark results from Mark Callaghan [1] show that
get_actual_variable_range isn't very effective, once the problem with
dead index tuples really starts to get out of hand. This is expected
with queue-like tables that continually have older records deleted as
newer ones are inserted. Right now this is the Achilles' heel for
Postgres when it runs on Mark's insert benchmark. It's also likely to
be a problem for TPC-C's order table. Tables that look like that are
reasonably common in real world workloads IME.

(My work on this problem is also related to the index prefetching
patch, which makes API revisions that aren't really compatible with
the current VISITED_PAGES_LIMIT design. More on that later.)

Problem
=======

VISITED_PAGES_LIMIT counts heap page visits. But when there are many
index tuples marked LP_DEAD, btgettuple will traverse
indefinitely-many *index* pages before returning even a single tuple
to the selfuncs.c caller. The selfuncs.c heap page counter never gets
a chance to increment. Profiling by Mark shows that we waste quite a
lot of CPU cycles in nbtree functions like _bt_readpage when this
happens. In practice there is no *natural limit* on the amount of work
that we can do during any one call to get_actual_variable_range --
VISITED_PAGES_LIMIT doesn't consider costs related to reading index
leaf pages at all.

The design of get_actual_variable_range specifically aims to set
LP_DEAD bits, with the goal of helping future calls to
get_actual_variable_range avoid heap fetches. But there's an important
sense in which this hurts rather than helps: each LP_DEAD tuple is one
less that can be counted against VISITED_PAGES_LIMIT going forward.
The more LP_DEAD bits we set, the less effective VISITED_PAGES_LIMIT
becomes at bailing out early during future calls to
get_actual_variable_range. Inevitably, the cost of reading index leaf
pages eventually becomes the bottleneck -- at least past some tipping
point. That effect seems perverse -- shouldn't setting LP_DEAD bits in
index tuples be strictly a good thing?

I suspect that VISITED_PAGES_LIMIT is fundamentally addressing the
problem in the wrong way. Even without VISITED_PAGES_LIMIT, or
anything like it, it is reasonable to expect that
get_actual_variable_range will require a very small, fixed amount of
work to find an extremal value in the vast majority of cases. After
all, we only need to locate one non-dead extremal index tuple, and
then we're done. If we get anywhere near VISITED_PAGES_LIMIT, then
it's already reasonable to suppose that we're dealing with a
pathological workload -- a workload exactly like the one that Mark's
testing ran into.

ISTM that Mark has demonstrated that VISITED_PAGES_LIMIT doesn't work
particularly well in general. AFAICT there is nothing special about
Mark's complaint -- it's not some special case that the
VISITED_PAGES_LIMIT design didn't directly consider. Any case where
get_actual_variable_range does many heap fetches is likely to look
rather similar. Again, we only need to find one non-dead index tuple!
If we can't find a suitable non-dead tuple on the first index page we
look at, then why should we find one on the second or the third?

Proposed solution
=================

We really should use an approach that makes a *hard guarantee* that
the costs that we pay will not exceed some known threshold (whether
those costs are from scanning index pages or scanning heap pages). If
we completely ignore costs paid by the index AM layer, then the
problem just shifts to that layer before too long.

Rather than counting heap page visits, I propose limiting the scan to
the extremal leaf page only. If the extremal index leaf page yields no
visible tuples, we give up immediately. The latest version of the
index prefetching patch [2] adds a WIP patch that does just that.

Testing of the patch posted to that other thead today shows that this
is much more effective. For example, with a 5 million row table, the
latency profile of running EXPLAIN about 66,000 times works out as
follows (with no dead tuples):

Baseline (no dead tuples):
  Metric           Master        Patch      Ratio
  ---------- ------------ ------------ ----------
  Avg               0.081        0.081      0.997x
  p95               0.096        0.095      0.993x
  p99               0.108        0.111      1.028x

Unsurprisingly, the profile for the patch looks very similar to that
of master. But when my test script deletes 20% of the tuples in the
table, and repeats the same process over, things look very different:

Main (with bulk-deleted tuples, concurrent inserts/deletes):
  Metric           Master        Patch      Ratio
  ---------- ------------ ------------ ----------
  Avg               0.933        0.080      0.085x
  p95               1.167        0.105      0.090x
  p99               1.195        0.133      0.112x

The patch is a little over 10x faster than master now. More
importantly, the patch performs very much like it did the first time
around, before we deleted any index tuples. We have a much more
meaningful guarantee about how bad things can get in the worst case.

I don't think that we need to consider heap visits; just limiting the
scan to a single leaf page seems to be sufficient. I'm open to the
idea of giving some consideration to the number of heap fetches
required, if testing can show that we still need some of that/we need
to be clevered. But I'm sure that we absolutely need to pay attention
to the cost of reading index leaf pages to do well here.

Note on index prefetching patch
===============================

As I touched on briefly already, and went into in detail on the index
prefetching patch thread, the current VISITED_PAGES_LIMIT design is
hard to make work the API revisions proposed by the index prefetching
patch. A recap on the problems it creates for the index prefetching
patch seems in order:

The index prefetching patch adds a new table_index_getnext_slot()
interface, rather like the current index_getnext_slot -- we need to
use that from selfuncs.c now (not the low-level index_getnext_tid +
index_fetch_heap interfaces). All visibility map lookups happen on the
table AM/heapam side with this new interface, which is a more modular
design. But it leaves selfuncs.c without any way to implement ad-hoc
counting of heap page accesses to enforce VISITED_PAGES_LIMIT; all of
the relevant heapam implementation details are hidden from callers
such as selfuncs.c.

I could work around this problem by inventing a new way to enforce
VISITED_PAGES_LIMIT that lives on the table AM side. But that doesn't
seem very appealing. We have to accomodate selfuncs.c's requirements
in lower-level code either way, it seems, so we might as well add that
mechanism to nbtree directly. Since it seems to fix the existing
problems with get_actual_variable_range.

Thoughts?

[1] 
https://smalldatum.blogspot.com/2026/01/cpu-bound-insert-benchmark-vs-postgres.html
[2] 
https://postgr.es/m/cah2-wznv9_kgqhq1vcw2pkia6qskbgcx5nc_-uxnd6geqas...@mail.gmail.com
--
Peter Geoghegan


Reply via email to