On Thu, Jan 11, 2018 at 10:35:22PM -0500, Stephen Frost wrote:
> Magnus, this was your thread to begin with, though I know others have
> been involved, any chance you'll be able to review this for commit
> during this CF?  I agree that this is certainly a good thing to have
> too, though I've not looked at the patch itself in depth.  Is there
> anything we can do to help move it along?

As an effort to move on with bug items in the commit fest, attached are
two patches with a proposed commit message as well as polished comments
Those are proposed for a back-patched.  The 2PC issue is particularly
bad in my opinion because having any 2PC file on-disk and corrupted
means that a transaction is lost.  I have been playing a bit with
hexedit and changed a couple of bytes in one of them...  If trying to
use a base backup which includes one, then the standby reading it would
be similarly confused.
--
Michael
From d12621c1e03abf8876d944ea3e831213111f2909 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Thu, 28 Jun 2018 14:38:24 +0900
Subject: [PATCH 1/2] Fail hard when facing corrupted two-phase state files

When a corrupted file is found by WAL replay, be it for crash recovery
or archive recovery, then the file is simply skipped and a WARNING is
logged to the user.  Facing an on-disk WAL file which is corrupted is
more likely to happen than its pair recorded in dedicated WAL records,
but if that happens then the instance faces data loss as the transaction
is not around anymore as it is not possible to commit it.

Reported-by: Michael Paquier
Author: Michael Paquier
Discussion: https://postgr.es/m/20161216060832.gb17...@paquier.xyz
---
 src/backend/access/transam/twophase.c | 28 ++++++++++-----------------
 src/backend/access/transam/xlog.c     | 10 +++++++---
 2 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index a9ef1b3d73..2da3d93d87 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1873,6 +1873,10 @@ restoreTwoPhaseData(void)
  * write a WAL entry, and so there might be no evidence in WAL of those
  * subxact XIDs.
  *
+ * On corrupted two-phase files, fail immediately. Keeping around broken
+ * entries and let replay continue causes harm on the system, and a new
+ * backup should be rolled in.
+ *
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return ShmemVariableCache->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
@@ -2165,13 +2169,9 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		/* Read and validate file */
 		buf = ReadTwoPhaseFile(xid, true);
 		if (buf == NULL)
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state file for transaction %u",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state file for \"%u\"",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
-			return NULL;
-		}
 	}
 	else
 	{
@@ -2184,21 +2184,13 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!TransactionIdEquals(hdr->xid, xid))
 	{
 		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state file for transaction %u",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state file for \"%u\"",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
 		else
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state from memory for transaction %u",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state in memory for \"%u\"",
 							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		pfree(buf);
-		return NULL;
 	}
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1a419aa49b..3695258e6f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7462,6 +7462,13 @@ StartupXLOG(void)
 		}
 	}
 
+	/*
+	 * Pre-scan prepared transactions to find out the range of XIDs present.
+	 * This information is not quite needed yet, but it is positioned here so
+	 * as potential problems are detected before any on-disk change is done.
+	 */
+	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+
 	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
@@ -7585,9 +7592,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	/* Pre-scan prepared transactions to find out the range of XIDs present */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
-
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
-- 
2.18.0

From b2862294f52b7262d7ec0cb2379688b2176169ea Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Thu, 28 Jun 2018 14:38:50 +0900
Subject: [PATCH 2/2] Minimize window between history file and end-of-recovery
 record

Once a standby node is promoted, this makes the assignment of the new
timeline number booked earlier as the history file gets archived
immediately.  This way the other nodes are aware that this new timeline
number is taken and should not be assigned to other nodes.

The window between which the history file is archived and the
end-of-recovery record is written cannot be zeroed, but this way it is
minimized as much as possible. The new order of actions prevents as well
a corrupted data directory on failure.

Reported-by: Magnus Hagander
Author: Heikki Linnakangas
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=xxmxlmf6ipnfecb...@mail.gmail.com
---
 src/backend/access/transam/xlog.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3695258e6f..f36d8049bb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7523,6 +7523,24 @@ StartupXLOG(void)
 		else
 			snprintf(reason, sizeof(reason), "no recovery target specified");
 
+		/*
+		 * We are now done reading the old WAL.  Turn off archive fetching if
+		 * it was active, and make a writable copy of the last WAL segment.
+		 * (Note that we also have a copy of the last block of the old WAL in
+		 * readBuf; we will use that below.)
+		 */
+		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby
+		 * servers. If we crash before actually switching to the new timeline,
+		 * standby servers will nevertheless think that we switched to the new
+		 * timeline, and will try to connect to the new timeline. To minimize
+		 * the window for that, try to do as little as possible between here
+		 * and writing the end-of-recovery record.
+		 */
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
 							 EndRecPtr, reason);
 	}
@@ -7531,15 +7549,6 @@ StartupXLOG(void)
 	XLogCtl->ThisTimeLineID = ThisTimeLineID;
 	XLogCtl->PrevTimeLineID = PrevTimeLineID;
 
-	/*
-	 * We are now done reading the old WAL.  Turn off archive fetching if it
-	 * was active, and make a writable copy of the last WAL segment. (Note
-	 * that we also have a copy of the last block of the old WAL in readBuf;
-	 * we will use that below.)
-	 */
-	if (ArchiveRecoveryRequested)
-		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
-
 	/*
 	 * Prepare to write WAL starting at EndOfLog location, and init xlog
 	 * buffer cache using the block containing the last record from the
-- 
2.18.0

Attachment: signature.asc
Description: PGP signature

Reply via email to