Hi Hackers, This is a follow-up to Matt Blewitt's report and patch from February [1], which identified the following bug: when the first XLOG_RUNNING_XACTS record a standby replays has subxid_overflow set, standbyState gets stuck at STANDBY_SNAPSHOT_PENDING, and hot standby is never activated. As a consequence, recovery_target_action = 'pause' is silently ignored: recoveryPausesHere() returns immediately because !LocalHotStandbyActive, the PAUSE case falls through, and the server promotes instead of pausing.
We'd like to propose an alternative fix for the same problem and describe why
we believe serving read-only queries in this state is safe, and why we
deliberately do not advance standbyState to STANDBY_SNAPSHOT_READY as the
earlier patch did.
Patches
=======
0001 - Behavior-preserving refactor: pull the connection-enabling block
out of CheckRecoveryConsistency() into a small helper.
0002 - The fix: call EnableHotStandbyConnections() from the
RECOVERY_TARGET_ACTION_PAUSE path, just before recoveryPausesHere(),
and add a TAP test.
Why we believe enabling reads is correct
=======
The reason a standby normally refuses queries from an overflowed snapshot is
the risk of an incorrect visibility decision for a subtransaction whose
top-level transaction is still running on the primary.
When the initial RUNNING_XACTS is overflowed, KnownAssignedXids may be missing
some subxids. For such a recovery snapshot, XidInMVCCSnapshot() first maps the
xid to its topmost parent via SubTransGetTopmostTransaction() and then looks
it up in the in-progress set. If that mapping is not available in pg_subtrans
and the xid is not present in KnownAssignedXids, the xid is treated as "not in
the snapshot", and the final committed/aborted decision is delegated to
TransactionIdDidCommit() in HeapTupleSatisfiesMVCC().
During active WAL replay, this is the dangerous case: a subxid S of a
still-running top transaction T may have its row present on disk; if S cannot
be resolved as in-progress and T's commit record is later replayed, CLOG flips
T to committed, and a query could suddenly see a row that was not committed as
of its snapshot's xmax. That is exactly why connections are withheld until a
non-overflowed snapshot (STANDBY_SNAPSHOT_READY) gives complete knowledge.
At an end-of-recovery pause, this hazard disappears because replay is frozen.
The only ways out of recoveryPausesHere(true) are promotion and shutdown. A
pg_wal_replay_resume() at the end of recovery falls through to promotion rather
than resuming replay.
Therefore, no commit record for any in-progress transaction will ever be
replayed, so CLOG cannot transition T (or its subxids) to committed after this
point. TransactionIdDidCommit() for such an xid stays false forever. So, the
MVCC visibility fallback keeps the row invisible.
In short, the set of transactions a query can observe as committed is now
stable, and is exactly the set that committed before replay stopped. The single
condition that makes overflowed-snapshot reads unsafe during live replay (a
transaction that was in progress as of a snapshot later being observed as
committed) cannot arise once replay halts. So the pending snapshot, while still
overflowed, yields correct and stable visibility.
Why we keep standbyState at STANDBY_SNAPSHOT_PENDING
=======
The earlier patch forced standbyState to STANDBY_SNAPSHOT_READY at the pause
point and re-ran CheckRecoveryConsistency(). We chose not to do that.
STANDBY_SNAPSHOT_READY means we have full knowledge of the transactions that
were running on the primary and that snapshots are complete and need not be
treated as overflowed. That is not true here since the snapshot is still
overflowed (visibility stays correct for the frozen-replay reason above, not
because the snapshot has become complete). Forcing the state to READY would
assert something false about the recovery state.
Original report and first patch by Matt Blewitt. Thanks also for the analysis
in that thread.
Thoughts welcome.
[1]
https://www.postgresql.org/message-id/CACy-Nv24ZORVN9_S_yHF5Nsip45HKCBtKVNC3XdKgz%2B1wvGvEQ%40mail.gmail.com
Best Regards
Jan Nidzwetzki
On behalf of PlanetScale
0001-Refactor-extract-EnableHotStandbyConnections-helper.patch
Description: Binary data
0002-Honor-recovery_target_action-pause-on-inconsistent-s.patch
Description: Binary data
