On 11.12.2012 08:07, Kyotaro HORIGUCHI wrote:
The cause is that CheckRecoveryConsistency() is called before rm_redo(),
as Horiguchi-san suggested upthead. Imagine the case where
minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
record. When restarting the server with that minRecoveryPoint,
the followings would happen, and then PANIC occurs.

1. XLOG_SMGR_TRUNCATE record is read.
2. CheckRecoveryConsistency() is called, and database is marked as
     consistent since we've reached minRecoveryPoint (i.e., the location
     of XLOG_SMGR_TRUNCATE).
3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
     found.
4. Since the database has already been marked as consistent, an invalid
     page leads to PANIC.

Exactly.

In smgr_redo, EndRecPtr which is pointing the record next to
SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other
hand, during the second startup of the standby,
CheckRecoveryConsistency checks for consistency by
XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at
just BEFORE the SMGR_TRUNCATE record is applied. So
reachedConsistency becomes true just before the SMGR_TRUNCATE
record will be applied. Bang!

Ah, I see. I thought moving CheckRecoveryConsistency was just a nice-to-have thing, so that we'd notice that we're consistent earlier, so didn't pay attention to that part.

I think the real issue here is that CheckRecoveryConsistency should not be comparing minRecoveryPoint against recoveryLastRecPtr, not EndRecPtr as it currently does. EndRecPtr points to the end of the last record *read*, while recoveryLastRecPtr points to the end of the last record *replayed*. The latter is what CheckRecoveryConsistency should be concerned with.

BTW, it occurs to me that we have two variables in the shared struct that are almost the same thing: recoveryLastRecPtr and replayEndRecPtr. The former points to the end of the last record successfully replayed, while replayEndRecPtr points to the end of the last record successfully replayed, or begin replayed at the moment. I find that confusing, so I suggest that we rename recoveryLastRecPtr to make that more clear. Included in the attached patch.

- Heikki
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9bd7f03..6961a4a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -445,11 +445,15 @@ typedef struct XLogCtlData
 	XLogRecPtr	lastCheckPointRecPtr;
 	CheckPoint	lastCheckPoint;
 
-	/* end+1 of the last record replayed (or being replayed) */
+	/*
+	 * lastReplayedEndRecPtr points to the end+1 of the last record
+	 * successfully replayed. When we're currently replaying a record, ie.
+	 * in a redo function,  replayEndRecPtr points to the end+1 of the record
+	 * being replayed, otherwise it's equal to lastReplayedEndRecPtr.
+	 */
+	XLogRecPtr	lastReplayedEndRecPtr;
 	XLogRecPtr	replayEndRecPtr;
 	TimeLineID	replayEndTLI;
-	/* end+1 of the last record replayed */
-	XLogRecPtr	recoveryLastRecPtr;
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 	/* current effective recovery target timeline */
@@ -5759,7 +5763,7 @@ StartupXLOG(void)
 		SpinLockAcquire(&xlogctl->info_lck);
 		xlogctl->replayEndRecPtr = ReadRecPtr;
 		xlogctl->replayEndTLI = ThisTimeLineID;
-		xlogctl->recoveryLastRecPtr = EndRecPtr;
+		xlogctl->lastReplayedEndRecPtr = EndRecPtr;
 		xlogctl->recoveryLastXTime = 0;
 		xlogctl->currentChunkStartTime = 0;
 		xlogctl->recoveryPause = false;
@@ -5983,11 +5987,11 @@ StartupXLOG(void)
 				}
 
 				/*
-				 * Update shared recoveryLastRecPtr after this record has been
-				 * replayed.
+				 * Update lastReplayedEndRecPtr after this record has been
+				 * successfully replayed.
 				 */
 				SpinLockAcquire(&xlogctl->info_lck);
-				xlogctl->recoveryLastRecPtr = EndRecPtr;
+				xlogctl->lastReplayedEndRecPtr = EndRecPtr;
 				SpinLockRelease(&xlogctl->info_lck);
 
 				/* Remember this record as the last-applied one */
@@ -6383,10 +6387,11 @@ CheckRecoveryConsistency(void)
 	 * Have we passed our safe starting point? Note that minRecoveryPoint
 	 * is known to be incorrectly set if ControlFile->backupEndRequired,
 	 * until the XLOG_BACKUP_RECORD arrives to advise us of the correct
-	 * minRecoveryPoint. All we prior to that is its not consistent yet.
+	 * minRecoveryPoint. All we know prior to that is that we're not
+	 * consistent yet.
 	 */
 	if (!reachedConsistency && !ControlFile->backupEndRequired &&
-		XLByteLE(minRecoveryPoint, EndRecPtr) &&
+		XLByteLE(minRecoveryPoint, XLogCtl->lastReplayedEndRecPtr) &&
 		XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
 	{
 		/*
@@ -6398,7 +6403,8 @@ CheckRecoveryConsistency(void)
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
-						(uint32) (EndRecPtr >> 32), (uint32) EndRecPtr)));
+						(uint32) (XLogCtl->lastReplayedEndRecPtr >> 32),
+						(uint32) XLogCtl->lastReplayedEndRecPtr)));
 	}
 
 	/*
@@ -9094,7 +9100,7 @@ GetXLogReplayRecPtr(TimeLineID *targetTLI)
 	XLogRecPtr	recptr;
 
 	SpinLockAcquire(&xlogctl->info_lck);
-	recptr = xlogctl->recoveryLastRecPtr;
+	recptr = xlogctl->lastReplayedEndRecPtr;
 	if (targetTLI)
 		*targetTLI = xlogctl->RecoveryTargetTLI;
 	SpinLockRelease(&xlogctl->info_lck);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to