Hi Hackers,

This is a follow-up to Matt Blewitt's report and patch from February [1], which 
identified the following bug: when the first XLOG_RUNNING_XACTS record a 
standby replays has subxid_overflow set, standbyState gets stuck at 
STANDBY_SNAPSHOT_PENDING, and hot standby is never activated. As a consequence, 
recovery_target_action = 'pause' is silently ignored: recoveryPausesHere() 
returns immediately because !LocalHotStandbyActive, the PAUSE case falls 
through, and the server promotes instead of pausing.

We'd like to propose an alternative fix for the same problem and describe why 
we believe serving read-only queries in this state is safe, and why we 
deliberately do not advance standbyState to STANDBY_SNAPSHOT_READY as the 
earlier patch did.


Patches
=======
 0001 - Behavior-preserving refactor: pull the connection-enabling block
        out of CheckRecoveryConsistency() into a small helper.

 0002 - The fix: call EnableHotStandbyConnections() from the
        RECOVERY_TARGET_ACTION_PAUSE path, just before recoveryPausesHere(), 
and add a TAP test.


Why we believe enabling reads is correct
=======

The reason a standby normally refuses queries from an overflowed snapshot is 
the risk of an incorrect visibility decision for a subtransaction whose 
top-level transaction is still running on the primary.

When the initial RUNNING_XACTS is overflowed, KnownAssignedXids may be missing 
some subxids. For such a recovery snapshot, XidInMVCCSnapshot() first maps the 
xid to its topmost parent via  SubTransGetTopmostTransaction() and then looks 
it up in the in-progress set. If that mapping is not available in pg_subtrans 
and the xid is not present in KnownAssignedXids, the xid is treated as "not in 
the snapshot", and the final committed/aborted decision is delegated to 
TransactionIdDidCommit() in HeapTupleSatisfiesMVCC().

During active WAL replay, this is the dangerous case: a subxid S of a 
still-running top transaction T may have its row present on disk; if S cannot 
be resolved as in-progress and T's commit record is later replayed, CLOG flips 
T to committed, and a query could suddenly see a row that was not committed as 
of its snapshot's xmax. That is exactly why connections are withheld until a 
non-overflowed snapshot (STANDBY_SNAPSHOT_READY) gives complete knowledge.

At an end-of-recovery pause, this hazard disappears because replay is frozen. 
The only ways out of recoveryPausesHere(true) are promotion and shutdown. A 
pg_wal_replay_resume() at the end of recovery falls through to promotion rather 
than resuming replay.

Therefore, no commit record for any in-progress transaction will ever be 
replayed, so CLOG cannot transition T (or its subxids) to committed after this 
point. TransactionIdDidCommit() for such an xid stays false forever. So, the 
MVCC visibility fallback keeps the row invisible.

In short, the set of transactions a query can observe as committed is now 
stable, and is exactly the set that committed before replay stopped. The single 
condition that makes overflowed-snapshot reads unsafe during live replay (a 
transaction that was in progress as of a snapshot later being observed as 
committed) cannot arise once replay halts. So the pending snapshot, while still 
overflowed, yields correct and stable visibility.


Why we keep standbyState at STANDBY_SNAPSHOT_PENDING
=======

The earlier patch forced standbyState to STANDBY_SNAPSHOT_READY at the pause 
point and re-ran CheckRecoveryConsistency(). We chose not to do that.

STANDBY_SNAPSHOT_READY means we have full knowledge of the transactions that 
were running on the primary and that snapshots are complete and need not be 
treated as overflowed. That is not true here since the snapshot is still 
overflowed (visibility stays correct for the frozen-replay reason above, not 
because the snapshot has become complete). Forcing the state to READY would 
assert something false about the recovery state.


Original report and first patch by Matt Blewitt. Thanks also for the analysis 
in that thread.

Thoughts welcome.

[1] 
https://www.postgresql.org/message-id/CACy-Nv24ZORVN9_S_yHF5Nsip45HKCBtKVNC3XdKgz%2B1wvGvEQ%40mail.gmail.com

Best Regards
Jan Nidzwetzki
On behalf of PlanetScale


Attachment: 0001-Refactor-extract-EnableHotStandbyConnections-helper.patch
Description: Binary data

Attachment: 0002-Honor-recovery_target_action-pause-on-inconsistent-s.patch
Description: Binary data

Reply via email to