On Wed, 21 Feb 2024 at 12:37, Michail Nikolaev <michail.nikol...@gmail.com> wrote: > > Hi! > > > How do you suppose this would work differently from a long-lived > > normal snapshot, which is how it works right now? > > Difference in the ability to take new visibility snapshot periodically > during the second phase with rechecking visibility of tuple according > to the "reference" snapshot (which is taken only once like now). > It is the approach from (1) but with a workaround for the issues > caused by heap_page_prune_opt. > > > Would it be exclusively for that relation? > Yes, only for that affected relation. Other relations are unaffected.
I suppose this could work. We'd also need to be very sure that the toast relation isn't cleaned up either: Even though that's currently DELETE+INSERT only and can't apply HOT, it would be an issue if we couldn't find the TOAST data of a deleted for everyone (but visible to us) tuple. Note that disabling cleanup for a relation will also disable cleanup of tuple versions in that table that are not used for the R/CIC snapshots, and that'd be an issue, too. > > How would this be integrated with e.g. heap_page_prune_opt? > Probably by some flag in RelationData, but not sure here yet. > > If the idea looks sane, I could try to extend my POC - it should be > not too hard, likely (I already have tests to make sure it is > correct). I'm not a fan of this approach. Changing visibility and cleanup semantics to only benefit R/CIC sounds like a pain to work with in essentially all visibility-related code. I'd much rather have to deal with another index AM, even if it takes more time: the changes in semantics will be limited to a new plug in the index AM system and a behaviour change in R/CIC, rather than behaviour that changes in all visibility-checking code. But regardless of second scan snapshots, I think we can worry about that part at a later moment: The first scan phase is usually the most expensive and takes the most time of all phases that hold snapshots, and in the above discussion we agreed that we can already reduce the time that a snapshot is held during that phase significantly. Sure, it isn't great that we have to scan the table again with only a single snapshot, but generally phase 2 doesn't have that much to do (except when BRIN indexes are involved) so this is likely less of an issue. And even if it is, we would still have reduced the number of long-lived snapshots by half. -Matthias