Attached patch series is a completely overhauled version of earlier work on freezing. Related work from the Postgres 15 cycle became commits 0b018fab, f3c15cbe, and 44fa8488.
Recap ===== The main high level goal of this work is to avoid painful, disruptive antiwraparound autovacuums (and other aggressive VACUUMs) that do way too much "catch up" freezing, all at once, causing significant disruption to production workloads. The patches teach VACUUM to care about how far behind it is on freezing for each table -- the number of unfrozen all-visible pages that have accumulated so far is directly and explicitly kept under control over time. Unfrozen pages can be seen as debt. There isn't necessarily anything wrong with getting into debt (getting into debt to a small degree is all but inevitable), but debt can be dangerous when it isn't managed carefully. Accumulating large amounts of debt doesn't always end badly, but it does seem to reliably create the *risk* that things will end badly. Right now, a standard append-only table could easily do *all* freezing in aggressive/antiwraparound VACUUM, without any earlier non-aggressive VACUUM operations triggered by autovacuum_vacuum_insert_threshold doing any freezing at all (unless the user goes out of their way to tune vacuum_freeze_min_age). There is currently no natural limit on the number of unfrozen all-visible pages that can accumulate -- unless you count age(relfrozenxid), the triggering condition for antiwraparound autovacuum. But relfrozenxid age predicts almost nothing about how much freezing is required (or will be required later on). The overall result is that it oftens takes far too long for freezing to finally happen, even when the table receives plenty of autovacuums (they all could freeze something, but in practice just don't freeze anything). It's very hard to avoid that through tuning, because what we really care about is something pretty closely related to (if not exactly) the number of unfrozen heap pages in the system. XID age is fundamentally "the wrong unit" here -- the physical cost of freezing is the most important thing, by far. In short, the goal of the patch series/project is to make autovacuum scheduling much more predictable over time. Especially with very large append-only tables. The patches improve the performance stability of VACUUM by managing costs holistically, over time. What happens in one single VACUUM operation is much less important than the behavior of successive VACUUM operations over time. What's new: freezing/skipping strategies ======================================== This newly overhauled version introduces the concept of per-VACUUM-operation strategies, which we decide on once per VACUUM, at the very start. There are 2 choices to be made at this point (right after we acquire OldestXmin and similar cutoffs): 1) Do we scan all-visible pages, or do we skip instead? (Added by second patch, involves a trade-off between eagerness and laziness.) 2) How should we freeze -- eagerly or lazily? (Added by third patch) The strategy-based approach can be thought of as something that blurs the distinction between aggressive and non-aggressive VACUUM, giving VACUUM more freedom to do either more or less work, based on known costs and benefits. This doesn't completely supersede aggressive/antiwraparound VACUUMs, but should make them much rarer with larger tables, where controlling freeze debt actually matters. There is a need to keep laziness and eagerness in balance here. We try to get the benefit of lazy behaviors/strategies, but will still course correct when it doesn't work out. A new GUC/reloption called vacuum_freeze_strategy_threshold is added to control freezing strategy (also influences our choice of skipping strategy). It defaults to 4GB, so tables smaller than that cutoff (which are usually the majority of all tables) will continue to freeze in much the same way as today by default. Our current lazy approach to freezing makes sense there, and should be preserved for its own sake. Compatibility ============= Structuring the new freezing behavior as an explicit user-configurable strategy is also useful as a bridge between the old and new freezing behaviors. It makes it fairly easy to get the old/current behavior where that's preferred -- which, I must admit, is something that wasn't well thought through last time around. The vacuum_freeze_strategy_threshold GUC is effectively (though not explicitly) a compatibility option. Users that want something close to the old/current behavior can use the GUC or reloption to more or less opt-out of the new freezing behavior, and can do so selectively. The GUC should be easy for users to understand, too -- it's just a table size cutoff. Skipping pages using a snapshot of the visibility map ===================================================== We now take a copy of the visibility map at the point that VACUUM begins, and work off of that when skipping, instead of working off of the mutable/authoritative VM -- this is a visibility map snapshot. This new infrastructure helps us to decide on a skipping strategy. Every non-aggressive VACUUM operation now has a choice to make: Which skipping strategy should it use? (This was introduced as item/question #1 a moment ago.) The decision on skipping strategy is a decision about our priorities for this table, at this time: Is it more important to advance relfrozenxid early (be eager), or to skip all-visible pages instead (be lazy)? If it's the former, then we must scan every single page that isn't all-frozen according to the VM snapshot (including every all-visible page). If it's the latter, we'll scan exactly 0 all-visible pages. Either way, once a decision has been made, we don't leave much to chance -- we commit. ISTM that this is the only approach that really makes sense. Fundamentally, we advance relfrozenxid a table at a time, and at most once per VACUUM operation. And for larger tables it's just impossible as a practical matter to have frequent VACUUM operations. We ought to be *somewhat* biased in the direction of advancing relfrozenxid by *some* amount during each VACUUM, even when relfrozenxid isn't all that old right now. A strategy (whether for skipping or for freezing) is a big, up-front decision -- and there are certain kinds of risks that naturally accompany that approach. The information driving the decision had better be fairly reliable! By using a VM snapshot, we can choose our skipping strategy based on precise information about how many *extra* pages we will have to scan if we go with eager scanning/relfrozenxid advancement. Concurrent activity cannot change what we scan and what we skip, either -- everything is locked in from the start. That seems important to me. It justifies trying to advance relfrozenxid early, just because the added cost of scanning any all-visible pages happens to be low. This is quite a big shift for VACUUM, at least in some ways. The patch adds a DETAIL to the "starting vacuuming" INFO message shown by VACUUM VERBOSE. The VERBOSE output is already supposed to work as a rudimentary progress indicator (at least when it is run at the database level), so it now shows the final scanned_pages up-front, before the physical scan of the heap even begins: regression=# vacuum verbose tenk1; INFO: vacuuming "regression.public.tenk1" DETAIL: total table size is 486 pages, 3 pages (0.62% of total) must be scanned INFO: finished vacuuming "regression.public.tenk1": index scans: 0 pages: 0 removed, 486 remain, 3 scanned (0.62% of total) *** SNIP *** system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s VACUUM I included this VERBOSE tweak in the second patch because it became natural with VM snapshots, and not because it felt particularly compelling -- scanned_pages just works like this now (an assertion verifies that our initial scanned_pages is always an exact match to what happened during the physical scan, in fact). There are many things that VM snapshots might also enable that aren't particularly related to freeze debt. VM snapshotting has the potential to enable more flexible behavior by VACUUM. I'm thinking of things like suspend-and-resume for VACUUM/autovacuum, or even autovacuum scheduling that coordinates autovacuum workers before and during processing by vacuumlazy.c. Locking-in scanned_pages up-front avoids the main downside that comes with throttling VACUUM right now: the fact that simply taking our time during VACUUM will tend to increase the number of concurrently modified pages that we end up scanning. These pages are bound to mostly just contain "recently dead" tuples that the ongoing VACUUM can't do much about anyway -- we could dirty a lot more heap pages as a result, for little to no benefit. New patch to avoid allocating MultiXacts ======================================== The fourth and final patch is also new. It corrects an undesirable consequence of the work done by the earlier patches: it makes VACUUM avoid allocating new MultiXactIds (unless it's fundamentally impossible, like in a VACUUM FREEZE). With just the first 3 patches applied, VACUUM will naively process xmax using a cutoff XID that comes from OldestXmin (and not FreezeLimit, which is how it works on HEAD). But with the fourth patch in place VACUUM applies an XID cutoff of either OldestXmin or FreezeLimit selectively, based on the costs and benefits for any given xmax. Just like in lazy_scan_noprune, the low level xmax-freezing code can pick and choose as it goes, within certain reasonable constraints. We must accept an older final relfrozenxid/relminmxid value for the rel's authoritative pg_class tuple as a consequence of avoiding xmax processing, of course, but that shouldn't matter at all (it's definitely better than the alternative). Reducing the WAL space overhead of freezing =========================================== Not included in this new v1 are other patches that control the overhead of added freezing -- my focus since joining AWS has been to get these more strategic patches in shape, and telling the right story about what I'm trying to do here. I'm going to say a little on the patches that I have in the pipeline here, though. Getting the low-level/mechanical overhead of freezing under control will probably require a few complementary techniques, not just high-level strategies (though the strategy stuff is the most important piece). The really interesting omitted-in-v1 patch adds deduplication of xl_heap_freeze_page WAL records. This reduces the space overhead of WAL records used to freeze by ~5x in most cases. It works in the obvious way: we just store the 12 byte freeze plans that appear in each xl_heap_freeze_page record only once, and then store an array of item offset numbers for each entry (rather than naively storing a full 12 bytes per tuple frozen per page-level WAL record). This means that we only need an "extra" ~2 bytes of WAL space per "extra" tuple frozen (2 bytes for an OffsetNumber) once we decide to freeze something on the same page. The *marginal* cost can be much lower than it is today, which makes page-based batching of freezing much more compelling IMV. Thoughts? -- Peter Geoghegan
v1-0001-Add-page-level-freezing-to-VACUUM.patch
Description: Binary data
v1-0004-Avoid-allocating-MultiXacts-during-VACUUM.patch
Description: Binary data
v1-0003-Add-eager-freezing-strategy-to-VACUUM.patch
Description: Binary data
v1-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patch
Description: Binary data