Hi Alexnader, Hackers, GetCurrentLSNForWaitType() for WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH previously relied on the WAL receiver's tracked write/flush positions (GetWalRcvWriteRecPtr/GetWalRcvFlushRecPtr). There are two scenarios where WAIT FOR LSN queries can be stalled though replay is making progress. Breaking it down to two to give clarity on setups but the underlying problem is the same.
There are two scenarios here: (1). When the standby is disconnected from the primary and switched to WAL archive mode, it continues to be in that mode until no more WAL is available to replay and then switch to streaming mode. Until then WAIT FOR LSN calls get stuck on the standby though replay catches up beyond the stale WAL receiver position. Switching XLog source from archive to streaming is separately tracked in [1]. (2). In the case of Archive recovery, no WAL receiver process exists, so these functions return InvalidXLogRecPtr (0/0). WAIT FOR LSN with standby_flush or standby_write modes would always time out, even for WAL that has been fully replayed. Fix by falling back to the replay LSN (GetXLogReplayRecPtr) when the WAL receiver position is invalid or behind replay. This is correct because any WAL that has been replayed has necessarily already been written and flushed to disk. Attached the repro test case. [1]: https://www.postgresql.org/message-id/cahg+qddlmfps0n0u3u+e+dw7x7jjeosjj0alesrtxs-tuyf...@mail.gmail.com Thanks, Satya
0001-Fix-WAIT-FOR-LSN-standby_write-standby_flush-for-arc.patch
Description: Binary data
0001-Add-TAP-test-for-WAIT-FOR-LSN-during-archive-recover.patch
Description: Binary data
