Hi folks, We observed a case where our backup tooling was periodically failing for a specific workload - nested subtrans overflowing subxid. We don't have visibility on the specific customer workload (i.e. either SAVEPOINT or EXCEPTION handling), but reproducing is covered in the TAP test.
The problem detail and proposed fix are described below. Happy to discuss further. Problem: When the first XLOG_RUNNING_XACTS record seen during recovery has subxid_overflow=true, the standby enters STANDBY_SNAPSHOT_PENDING and hot standby never activates (LocalHotStandbyActive stays false). This caused recovery_target_action = 'pause' to be silently bypassed: recoveryPausesHere() returns immediately when hot standby is not yet active, so the pause is skipped and the server promotes instead. Fix: in PerformWalRecovery(), when the recovery target is reached and the snapshot is still PENDING, force a transition to STANDBY_SNAPSHOT_READY and call CheckRecoveryConsistency() to activate hot standby before the target action switch is evaluated. As I understand it, this is safe because subtransaction commits write to CLOG but produce no WAL entry, so standbys always see overflowed subxids as INPROGRESS rather than SUB_COMMITTED. INPROGRESS subxids are invisible without any SubTrans lookup, so the missing SubTrans entries that STANDBY_SNAPSHOT_PENDING guards against cannot cause incorrect visibility results. Add a TAP test (052_pitr_subxid_overflow.pl) that exercises the scenario: the overflow transaction is kept open during the base backup's forced checkpoint so that the very first XLOG_RUNNING_XACTS the standby replays has subxid_overflow=true. A named restore point is then created while the overflow transaction is still open. Without the fix the standby promotes silently at the target; with the fix it pauses and accepts hot-standby queries. Note: subtransaction XIDs are only assigned when the subtransaction writes, so gen_subxids() must perform an INSERT at each recursion level to force the PGPROC subxid cache to overflow. I would consider this for backporting to supported releases.
0001-Fix-PITR-pause-bypass-when-initial-XLOG_RUNNING_XACT.patch
Description: Binary data
