Re: [BUGS] PITR potentially broken in 9.2

Tom Lane Tue, 04 Dec 2012 18:27:46 -0800

Andres Freund <[email protected]> writes:
>> But the key is, the database was not actually consistent at that
>> point, and so opening hot standby was a dangerous thing to do.
>> 
>> The bug that allowed the database to open early (the original topic if
>> this email chain) was masking this secondary issue.


> Could you check whether the attached patch fixes the behaviour?

Yeah, I had just come to the same conclusion: the bug is not with
Heikki's fix, but with the pause logic.  The comment says that it
shouldn't pause unless users can connect to un-pause, but the actual
implementation of that test is several bricks shy of a load.

Your patch is better, but it's still missing a bet: what we need to be
sure of is not merely that we *could* have told the postmaster to start
hot standby, but that we *actually have done so*.  Otherwise, it's
flow-of-control-dependent whether it works or not; we could fail if the
main loop hasn't gotten to CheckRecoveryConsistency since the conditions
became true.  Looking at the code in CheckRecoveryConsistency, testing
LocalHotStandbyActive seems appropriate.

I also thought it was pretty dangerous that this absolutely critical
test was not placed in recoveryPausesHere() itself, rather than relying
on the call sites to remember to do it.

So the upshot is that I propose a patch more like the attached.

This is not a regression because the pause logic is broken this same
way since 9.1.  So I no longer think that we need a rewrap.

                        regards, tom lane

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 50e2b22..593dcee 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** recoveryStopsHere(XLogRecord *record, bo
*** 5923,5928 ****
--- 5923,5932 ----
  static void
  recoveryPausesHere(void)
  {
+ 	/* Don't pause unless users can connect! */
+ 	if (!LocalHotStandbyActive)
+ 		return;
+ 
  	ereport(LOG,
  			(errmsg("recovery has paused"),
  			 errhint("Execute pg_xlog_replay_resume() to continue.")));
*************** StartupXLOG(void)
*** 6697,6707 ****
  				 */
  				if (recoveryStopsHere(record, &recoveryApply))
  				{
! 					/*
! 					 * Pause only if users can connect to send a resume
! 					 * message
! 					 */
! 					if (recoveryPauseAtTarget && standbyState == STANDBY_SNAPSHOT_READY)
  					{
  						SetRecoveryPause(true);
  						recoveryPausesHere();
--- 6701,6707 ----
  				 */
  				if (recoveryStopsHere(record, &recoveryApply))
  				{
! 					if (recoveryPauseAtTarget)
  					{
  						SetRecoveryPause(true);
  						recoveryPausesHere();
*************** StartupXLOG(void)
*** 6737,6752 ****
  				/*
  				 * Update shared replayEndRecPtr before replaying this record,
  				 * so that XLogFlush will update minRecoveryPoint correctly.
  				 */
  				SpinLockAcquire(&xlogctl->info_lck);
  				xlogctl->replayEndRecPtr = EndRecPtr;
  				recoveryPause = xlogctl->recoveryPause;
  				SpinLockRelease(&xlogctl->info_lck);
  
! 				/*
! 				 * Pause only if users can connect to send a resume message
! 				 */
! 				if (recoveryPause && standbyState == STANDBY_SNAPSHOT_READY)
  					recoveryPausesHere();
  
  				/*
--- 6737,6751 ----
  				/*
  				 * Update shared replayEndRecPtr before replaying this record,
  				 * so that XLogFlush will update minRecoveryPoint correctly.
+ 				 *
+ 				 * While we have the lock, also check for a pause request.
  				 */
  				SpinLockAcquire(&xlogctl->info_lck);
  				xlogctl->replayEndRecPtr = EndRecPtr;
  				recoveryPause = xlogctl->recoveryPause;
  				SpinLockRelease(&xlogctl->info_lck);
  
! 				if (recoveryPause)
  					recoveryPausesHere();
  
  				/*

-- 
Sent via pgsql-bugs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] PITR potentially broken in 9.2

Reply via email to