Hi
I believe I have discovered the following problem in pgsql 8.2 and HEAD,
concerning warm-standbys using WAL log shipping.
The problem is that after a crash, the master might complete incomplete
actions via rm_cleanup() - but since it won't wal-log those changes,
the slave won't know about this. This will at least prevent the creation
of any further restart points on the slave (because safe_restartpoint)
will never return true again - it it might even cause data corruption,
if subsequent wal records are interpreted wrongly by the slave because
it sees other data than the master did when it generated them.
Attached is a patch that lets RecoveryRestartPoint call all
rm_cleanup() methods and create a restart point whenever it encounters
a shutdown checkpoint in the wal (because those are generated after
recovery). This ought not cause a performance degradation, because
shutdown checkpoints will occur very infrequently.
The patch is per discussion with Simon Riggs.
I've not yet had a chance to test this patch, I only made sure
that it compiles. I'm sending this out now because I hope this
might make it into 8.2.4.
greetings, Florian Pflug
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6c67821..93c86a1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5060,10 +5060,13 @@ #endif
* Perform a checkpoint to update all our recovery activity to disk.
*
* Note that we write a shutdown checkpoint rather than an on-line
- * one. This is not particularly critical, but since we may be
- * assigning a new TLI, using a shutdown checkpoint allows us to have
- * the rule that TLI only changes in shutdown checkpoints, which
- * allows some extra error checking in xlog_redo.
+ * one. A slave will always create a restart point if it sees a
+ * shutdown checkpoint, and will call all rm_cleanup() methods before
+ * it does so. This guarantees that any actions taken by the master
+ * in rm_cleanup will also be carried out on the slave.
+ * Additionally, we may be assigning a new TLI, so using a shutdow
+ * checkpoint allows us to have the rule that TLI only changes in shutdown
+ * checkpoints, which allows some extra error checking in xlog_redo.
*/
CreateCheckPoint(true, true);
@@ -5672,35 +5675,56 @@ CheckPointGuts(XLogRecPtr checkPointRedo
* restartpoint is needed or not.
*/
static void
-RecoveryRestartPoint(const CheckPoint *checkPoint)
+RecoveryRestartPoint(const CheckPoint *checkPoint, const bool shutdownCheckpoint)
{
int elapsed_secs;
int rmid;
/*
- * Do nothing if the elapsed time since the last restartpoint is less than
- * half of checkpoint_timeout. (We use a value less than
- * checkpoint_timeout so that variations in the timing of checkpoints on
- * the master, or speed of transmission of WAL segments to a slave, won't
- * make the slave skip a restartpoint once it's synced with the master.)
- * Checking true elapsed time keeps us from doing restartpoints too often
- * while rapidly scanning large amounts of WAL.
+ * If the checkpoint we saw in the wal was a shutdown checkpoint, it might
+ * have been written after the recovery following a crash of the master.
+ * In that case, the master will have completed any actions that were
+ * incomplete when it crashed *during recovery*, and these completions
+ * are therefor *not* logged in the wal.
+ * To prevent getting out of sync, we follow what the master did, and
+ * call the rm_cleanup() methods. To be on the safe side, we then create
+ * a RestartPoint, regardless of the time elapsed. Note that asking
+ * the resource managers if they have partial state would be redundant
+ * after calling rm_cleanup().
*/
- elapsed_secs = time(NULL) - ControlFile->time;
- if (elapsed_secs < CheckPointTimeout / 2)
- return;
+ if (shutdownCheckpoint) {
+ for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
+ {
+ if (RmgrTable[rmid].rm_cleanup != NULL)
+ RmgrTable[rmid].rm_cleanup();
+ }
+ }
+ else {
+ /*
+ * Do nothing if the elapsed time since the last restartpoint is less than
+ * half of checkpoint_timeout. (We use a value less than
+ * checkpoint_timeout so that variations in the timing of checkpoints on
+ * the master, or speed of transmission of WAL segments to a slave, won't
+ * make the slave skip a restartpoint once it's synced with the master.)
+ * Checking true elapsed time keeps us from doing restartpoints too often
+ * while rapidly scanning large amounts of WAL.
+ */
+ elapsed_secs = time(NULL) - ControlFile->time;
+ if (elapsed_secs < CheckPointTimeout / 2)
+ return;
- /*
- * Is it safe to checkpoint? We must ask each of the resource managers
- * whether they have any partial state information that might prevent a
- * correct restart from this point. If so, we skip this opportunity, but
- * return at the next checkpoint record for another try.
- */
- for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
- {
- if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
- if (!(RmgrTable[rmid].rm_safe_restartpoint()))
- return;
+ /*
+ * Is it safe to checkpoint? We must ask each of the resource managers
+ * whether they have any partial state information that might prevent a
+ * correct restart from this point. If so, we skip this opportunity, but
+ * return at the next checkpoint record for another try.
+ */
+ for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
+ {
+ if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
+ if (!(RmgrTable[rmid].rm_safe_restartpoint()))
+ return;
+ }
}
/*
@@ -5835,7 +5859,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *re
ThisTimeLineID = checkPoint.ThisTimeLineID;
}
- RecoveryRestartPoint(&checkPoint);
+ RecoveryRestartPoint(&checkPoint, true);
}
else if (info == XLOG_CHECKPOINT_ONLINE)
{
@@ -5864,7 +5888,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *re
(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
checkPoint.ThisTimeLineID, ThisTimeLineID)));
- RecoveryRestartPoint(&checkPoint);
+ RecoveryRestartPoint(&checkPoint, false);
}
else if (info == XLOG_SWITCH)
{
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster