[HACKERS] [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync

Florian G. Pflug Thu, 19 Apr 2007 13:29:50 -0700

Hi

I believe I have discovered the following problem in pgsql 8.2 and HEAD,
concerning warm-standbys using WAL log shipping.


The problem is that after a crash, the master might complete incomplete
actions via rm_cleanup() - but since it won't wal-log those changes,
the slave won't know about this. This will at least prevent the creation
of any further restart points on the slave (because safe_restartpoint)
will never return true again - it it might even cause data corruption,
if subsequent wal records are interpreted wrongly by the slave because
it sees other data than the master did when it generated them.

Attached is a patch that lets RecoveryRestartPoint call all
rm_cleanup() methods and create a restart point whenever it encounters
a shutdown checkpoint in the wal (because those are generated after
recovery). This ought not cause a performance degradation, because
shutdown checkpoints will occur very infrequently.

The patch is per discussion with Simon Riggs.

I've not yet had a chance to test this patch, I only made sure
that it compiles. I'm sending this out now because I hope this
might make it into 8.2.4.

greetings, Florian Pflug

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6c67821..93c86a1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5060,10 +5060,13 @@ #endif
 		 * Perform a checkpoint to update all our recovery activity to disk.
 		 *
 		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
+		 * one. A slave will always create a restart point if it sees a
+		 * shutdown checkpoint, and will call all rm_cleanup() methods before
+		 * it does so. This guarantees that any actions taken by the master
+		 * in rm_cleanup will also be carried out on the slave.
+		 * Additionally, we may be assigning a new TLI, so using a shutdow
+		 * checkpoint allows us to have the rule that TLI only changes in shutdown
+		 * checkpoints, which allows some extra error checking in xlog_redo.
 		 */
 		CreateCheckPoint(true, true);
 
@@ -5672,35 +5675,56 @@ CheckPointGuts(XLogRecPtr checkPointRedo
  * restartpoint is needed or not.
  */
 static void
-RecoveryRestartPoint(const CheckPoint *checkPoint)
+RecoveryRestartPoint(const CheckPoint *checkPoint, const bool shutdownCheckpoint)
 {
 	int			elapsed_secs;
 	int			rmid;
 
 	/*
-	 * Do nothing if the elapsed time since the last restartpoint is less than
-	 * half of checkpoint_timeout.	(We use a value less than
-	 * checkpoint_timeout so that variations in the timing of checkpoints on
-	 * the master, or speed of transmission of WAL segments to a slave, won't
-	 * make the slave skip a restartpoint once it's synced with the master.)
-	 * Checking true elapsed time keeps us from doing restartpoints too often
-	 * while rapidly scanning large amounts of WAL.
+	 * If the checkpoint we saw in the wal was a shutdown checkpoint, it might
+	 * have been written after the recovery following a crash of the master.
+	 * In that case, the master will have completed any actions that were
+	 * incomplete when it crashed *during recovery*, and these completions
+	 * are therefor *not* logged in the wal.
+	 * To prevent getting out of sync, we follow what the master did, and
+	 * call the rm_cleanup() methods. To be on the safe side, we then create
+	 * a RestartPoint, regardless of the time elapsed. Note that asking
+	 * the resource managers if they have partial state would be redundant
+	 * after calling rm_cleanup().
 	 */
-	elapsed_secs = time(NULL) - ControlFile->time;
-	if (elapsed_secs < CheckPointTimeout / 2)
-		return;
+	if (shutdownCheckpoint) {
+		for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
+		{
+			if (RmgrTable[rmid].rm_cleanup != NULL)
+				RmgrTable[rmid].rm_cleanup();
+		}
+	}
+	else {
+		/*
+		 * Do nothing if the elapsed time since the last restartpoint is less than
+		 * half of checkpoint_timeout.	(We use a value less than
+		 * checkpoint_timeout so that variations in the timing of checkpoints on
+		 * the master, or speed of transmission of WAL segments to a slave, won't
+		 * make the slave skip a restartpoint once it's synced with the master.)
+		 * Checking true elapsed time keeps us from doing restartpoints too often
+		 * while rapidly scanning large amounts of WAL.
+		 */
+		elapsed_secs = time(NULL) - ControlFile->time;
+		if (elapsed_secs < CheckPointTimeout / 2)
+			return;
 
-	/*
-	 * Is it safe to checkpoint?  We must ask each of the resource managers
-	 * whether they have any partial state information that might prevent a
-	 * correct restart from this point.  If so, we skip this opportunity, but
-	 * return at the next checkpoint record for another try.
-	 */
-	for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
-	{
-		if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
-			if (!(RmgrTable[rmid].rm_safe_restartpoint()))
-				return;
+		/*
+		 * Is it safe to checkpoint?  We must ask each of the resource managers
+		 * whether they have any partial state information that might prevent a
+		 * correct restart from this point.  If so, we skip this opportunity, but
+		 * return at the next checkpoint record for another try.
+		 */
+		for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
+		{
+			if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
+				if (!(RmgrTable[rmid].rm_safe_restartpoint()))
+					return;
+		}
 	}
 
 	/*
@@ -5835,7 +5859,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *re
 			ThisTimeLineID = checkPoint.ThisTimeLineID;
 		}
 
-		RecoveryRestartPoint(&checkPoint);
+		RecoveryRestartPoint(&checkPoint, true);
 	}
 	else if (info == XLOG_CHECKPOINT_ONLINE)
 	{
@@ -5864,7 +5888,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *re
 					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
 							checkPoint.ThisTimeLineID, ThisTimeLineID)));
 
-		RecoveryRestartPoint(&checkPoint);
+		RecoveryRestartPoint(&checkPoint, false);
 	}
 	else if (info == XLOG_SWITCH)
 	{

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

[HACKERS] [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync

Reply via email to