Hi Ayush, On Wed, May 6, 2026 at 3:18 PM Ayush Tiwari <[email protected]> wrote: > > Hi, > > I found a backend crash in WAIT FOR LSN when it is interrupted inside a > savepoint and the session then waits again. > > I tried to find if it was already reported, but could not find it, so, > posting it. > > While navigating I noticed WAIT FOR LSN cleanup is incomplete on > subtransaction abort. An interrupt such as statement_timeout while > waiting inside a savepoint leaves stale per-backend wait state, > causing a later WAIT FOR LSN in the same backend to violate > the wait-heap invariant and crash an assertion-enabled build. > > A small reproducer is: > > BEGIN; > SAVEPOINT s; > SET statement_timeout = '100ms'; > WAIT FOR LSN '<future-lsn>' WITH (MODE 'primary_flush'); > ROLLBACK TO s; > SET statement_timeout = 0; > WAIT FOR LSN '0/0' WITH (MODE 'primary_flush', TIMEOUT '10ms', NO_THROW); > COMMIT; > > where <future-lsn> can be generated with: > > SELECT pg_current_wal_insert_lsn() + 10000000000; > > TRAP: failed Assert("!procInfo->inHeap"), File: "xlogwait.c" > > The attached patch mirrors the top-level abort cleanup by calling > WaitLSNCleanup() from AbortSubTransaction(), after LWLockReleaseAll(). It > also adds a TAP test to verify that WAIT FOR LSN can be reused in the same > backend after a statement_timeout and ROLLBACK TO SAVEPOINT. > > Thoughts?
Thanks for reporting this. I agree with your analysis. We need to add this clean-up into AbortSubTransaction. I've some comments on the patch: 1) Update the comment of WaitLSNCleanup Now the comment of this function says, "Clean up LSN waiters for exiting process." After this patch, this description would be too narrow because after a ROLLBACK TO SAVEPOINT, the same backend continues running and may issue another WAIT FOR LSN. How about changing it to something like: /* * Clean up any LSN wait state for the current process. */ 2) How about turning the SET statement_timeout = '100ms'; into more deterministic machinery used in other sections of the test file to ensure that the registration actually occurs by starting the waiter in a background psql session and waits until itself reports: wait_event = 'WaitForWalFlush' for that backend. 3) For the validation, consider replacing the fast success of '0/0' with the timeout of the unreachable LSN used earlier. This would reduce assumptions about the order between registration and the fast check being fixed. -- Best, Xuneng
v2-0001-Fix-WAIT-FOR-LSN-cleanup-on-subtransaction-abort.patch
Description: Binary data
