Recently, I've been thinking a lot about the interlocking protocol that prevents wrong answers during index-only scans, even with concurrent TID recycling (since it is relevant to the index prefetching work). I'm referring to the way index-only scans generally hold a buffer pin on their scan's current leaf page position, which will conflict with the cleanup locks that index vacuuming acquires on every leaf page.
There are historical (and current) bugs that share the same basic shape. There are live bugs in GiST and SP-GiST index only scans [1]. A similar bug also affected bitmap scans, which previously used information from the visibility map as an optimization (before we fixed that bug) [2]. Unfortunately, there appears to be yet another bug of that general nature. The bug in question affects nbtree index-only scans running during hot standby; these scans can see "phantom" resurrected rows when VACUUM recycles stub LP_DEAD line pointers in heap pages sooner than is safe. The LP_DEAD stubs are needed as tombstones, but VACUUM can sometimes win the race and mark them LP_UNUSED prematurely -- also setting the relevant heap page all-visible in the VM. In other words, this bug's general symptoms match those of the other bugs I mentioned. This is only possible because the standby won't acquire cleanup locks on *every* index page -- unlike during original execution. It will only cleanup lock whatever index pages actually had one or more index tuples removed during VACUUM, which isn't quite good enough. In other words, the rationale for removing the "pin scan" logic in commits f65b94f6, 3e4b7d87, and 687f2cd7 was subtly flawed in that it didn't consider index-only scans, which are legitimately a special case. Attached are 2 patches, both intended to show the general nature of the problem. The first patch is a repro written by Claude code at my direction; there are many tedious and fiddly details involved that aren't worth discussing now. Multiple test cases show wrong answers, allowing the bug to manifest in several different ways (delete+commit vs insert+abort, page split vs index deletion). The second patch resurrects the old "pin scan" logic into modern nbtree, making the failing tests pass, and confirming my understanding of the problem. Neither patch is committable. The pin scan mechanism performed terribly, and I cannot countenance actually bringing it back now. I haven't yet given much thought to how we can fix this bug without causing more harm than good. The rationale for removing the "pin scan" logic was *almost* correct back in 2016; we simply failed to consider how index-only scans are a special case (which, crucially, wasn't documented anywhere in 2016, and still isn't today). [1] https://postgr.es/m/CAH2-Wz=jjinl9fch8c1l-guh15f4wftwub2x+_nucngcddc...@mail.gmail.com [2] Fixed by April 2025 commit 459e7bf8 -- Peter Geoghegan
v1-0001-Add-a-hot-standby-index-only-scan-TID-recycling-r.patch
Description: Binary data
v1-0002-nbtree-resurrect-the-recovery-side-pin-scan-for-V.patch
Description: Binary data
