Andres Freund <and...@2ndquadrant.com> writes: >> But the key is, the database was not actually consistent at that >> point, and so opening hot standby was a dangerous thing to do. >> >> The bug that allowed the database to open early (the original topic if >> this email chain) was masking this secondary issue.
> Could you check whether the attached patch fixes the behaviour? Yeah, I had just come to the same conclusion: the bug is not with Heikki's fix, but with the pause logic. The comment says that it shouldn't pause unless users can connect to un-pause, but the actual implementation of that test is several bricks shy of a load. Your patch is better, but it's still missing a bet: what we need to be sure of is not merely that we *could* have told the postmaster to start hot standby, but that we *actually have done so*. Otherwise, it's flow-of-control-dependent whether it works or not; we could fail if the main loop hasn't gotten to CheckRecoveryConsistency since the conditions became true. Looking at the code in CheckRecoveryConsistency, testing LocalHotStandbyActive seems appropriate. I also thought it was pretty dangerous that this absolutely critical test was not placed in recoveryPausesHere() itself, rather than relying on the call sites to remember to do it. So the upshot is that I propose a patch more like the attached. This is not a regression because the pause logic is broken this same way since 9.1. So I no longer think that we need a rewrap. regards, tom lane
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 50e2b22..593dcee 100644 *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *************** recoveryStopsHere(XLogRecord *record, bo *** 5923,5928 **** --- 5923,5932 ---- static void recoveryPausesHere(void) { + /* Don't pause unless users can connect! */ + if (!LocalHotStandbyActive) + return; + ereport(LOG, (errmsg("recovery has paused"), errhint("Execute pg_xlog_replay_resume() to continue."))); *************** StartupXLOG(void) *** 6697,6707 **** */ if (recoveryStopsHere(record, &recoveryApply)) { ! /* ! * Pause only if users can connect to send a resume ! * message ! */ ! if (recoveryPauseAtTarget && standbyState == STANDBY_SNAPSHOT_READY) { SetRecoveryPause(true); recoveryPausesHere(); --- 6701,6707 ---- */ if (recoveryStopsHere(record, &recoveryApply)) { ! if (recoveryPauseAtTarget) { SetRecoveryPause(true); recoveryPausesHere(); *************** StartupXLOG(void) *** 6737,6752 **** /* * Update shared replayEndRecPtr before replaying this record, * so that XLogFlush will update minRecoveryPoint correctly. */ SpinLockAcquire(&xlogctl->info_lck); xlogctl->replayEndRecPtr = EndRecPtr; recoveryPause = xlogctl->recoveryPause; SpinLockRelease(&xlogctl->info_lck); ! /* ! * Pause only if users can connect to send a resume message ! */ ! if (recoveryPause && standbyState == STANDBY_SNAPSHOT_READY) recoveryPausesHere(); /* --- 6737,6751 ---- /* * Update shared replayEndRecPtr before replaying this record, * so that XLogFlush will update minRecoveryPoint correctly. + * + * While we have the lock, also check for a pause request. */ SpinLockAcquire(&xlogctl->info_lck); xlogctl->replayEndRecPtr = EndRecPtr; recoveryPause = xlogctl->recoveryPause; SpinLockRelease(&xlogctl->info_lck); ! if (recoveryPause) recoveryPausesHere(); /*
-- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs