Hi, On 2022-02-19 18:16:54 -0800, Peter Geoghegan wrote: > On Sat, Feb 19, 2022 at 5:54 PM Andres Freund <and...@anarazel.de> wrote: > > How does that cause the endless loop? > > Attached is the page image itself, dumped via gdb (and gzip'd). This > was on recent HEAD (commit 8f388f6f, actually), plus > 0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No > defragmenting in pg_surgery, nothing like that.
> > It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for > > me. So something needs have changed with your patch? > > It doesn't always happen -- only about half the time on my machine. > Maybe it's timing sensitive? Ah, I'd only run the tests three times or so, without it happening. Trying a few more times repro'd it. It's kind of surprising that this needs this 0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question of hint bits changing due to lazy_scan_noprune(), which then makes HeapTupleHeaderIsHotUpdated() have a different return value, preventing the "If the tuple is DEAD and doesn't chain to anything else" path from being taken. > We hit the "goto retry" on offnum 2, which is the first tuple with > storage (you can see "the ghost" of the tuple from the LP_DEAD item at > offnum 1, since the page isn't defragmented in pg_surgery). I think > that this happens because the heap-only tuple at offnum 2 is fully > DEAD to lazy_scan_prune, but hasn't been recognized as such by > heap_page_prune. There is no way that they'll ever "agree" on the > tuple being DEAD right now, because pruning still doesn't assume that > an orphaned heap-only tuple is fully DEAD. > We can either do that, or we can throw an error concerning corruption > when heap_page_prune notices orphaned tuples. Neither seems > particularly appealing. But it definitely makes no sense to allow > lazy_scan_prune to spin in a futile attempt to reach agreement with > heap_page_prune about a DEAD tuple really being DEAD. Yea, this sucks. I think we should go for the rewrite of the heap_prune_chain() logic. The current approach is just never going to be robust. Greetings, Andres Freund