Hi, On Wed, 6 May 2026 at 15:13, Alexander Korotkov <[email protected]> wrote:
> On Wed, May 6, 2026 at 11:46 AM Xuneng Zhou <[email protected]> wrote: > > > > On Wed, May 6, 2026 at 3:18 PM Ayush Tiwari <[email protected]> > wrote: > > > > > > Hi, > > > > > > I found a backend crash in WAIT FOR LSN when it is interrupted inside a > > > savepoint and the session then waits again. > > > > > > I tried to find if it was already reported, but could not find it, so, > posting it. > > > > > > While navigating I noticed WAIT FOR LSN cleanup is incomplete on > > > subtransaction abort. An interrupt such as statement_timeout while > > > waiting inside a savepoint leaves stale per-backend wait state, > > > causing a later WAIT FOR LSN in the same backend to violate > > > the wait-heap invariant and crash an assertion-enabled build. > > > > > > A small reproducer is: > > > > > > BEGIN; > > > SAVEPOINT s; > > > SET statement_timeout = '100ms'; > > > WAIT FOR LSN '<future-lsn>' WITH (MODE 'primary_flush'); > > > ROLLBACK TO s; > > > SET statement_timeout = 0; > > > WAIT FOR LSN '0/0' WITH (MODE 'primary_flush', TIMEOUT '10ms', > NO_THROW); > > > COMMIT; > > > > > > where <future-lsn> can be generated with: > > > > > > SELECT pg_current_wal_insert_lsn() + 10000000000; > > > > > > TRAP: failed Assert("!procInfo->inHeap"), File: "xlogwait.c" > > > > > > The attached patch mirrors the top-level abort cleanup by calling > > > WaitLSNCleanup() from AbortSubTransaction(), after > LWLockReleaseAll(). It > > > also adds a TAP test to verify that WAIT FOR LSN can be reused in the > same > > > backend after a statement_timeout and ROLLBACK TO SAVEPOINT. > > > > > > Thoughts? > > > > Thanks for reporting this. I agree with your analysis. > > +1, thank you for reporting this. > > > We need to add > > this clean-up into AbortSubTransaction. I've some comments on the > > patch: > > > > 1) Update the comment of WaitLSNCleanup > > > > Now the comment of this function says, "Clean up LSN waiters for > > exiting process." After this patch, this description would be too > > narrow because after a ROLLBACK TO SAVEPOINT, the same backend > > continues running and may issue another WAIT FOR LSN. How about > > changing it to something like: > > > > /* > > * Clean up any LSN wait state for the current process. > > */ > > I agree with this change. Under detailed consideration, "existing > process" sounds confusing in this context even if it's called just > from AbortTransaction(). > > > 2) How about turning the SET statement_timeout = '100ms'; into more > > deterministic machinery used in other sections of the test file to > > ensure that the registration actually occurs by starting the waiter in > > a background psql session and waits until itself reports: wait_event = > > 'WaitForWalFlush' for that backend. > > > > 3) For the validation, consider replacing the fast success of '0/0' > > with the timeout of the unreachable LSN used earlier. This would > > reduce assumptions about the order between registration and the fast > > check being fixed. > > I'm OK with these changes too, as they improve the test stability. > > I'm going to push this if no objections. > > Looks good to me. Thanks for the update and review! Regards, Ayush
