Re: [HACKERS] Crash on promotion when recovery.conf is renamed

Michael Paquier Mon, 19 Dec 2016 23:55:20 -0800

On Sat, Dec 17, 2016 at 9:23 PM, Magnus Hagander <mag...@hagander.net> wrote:
> On Fri, Dec 16, 2016 at 7:08 AM, Michael Paquier <michael.paqu...@gmail.com>
> wrote:
>> Looking at PrescanPreparedTransactions(), I am thinking as well that it
>> would
>> be better to get a hard failure when bumping on a corrupted 2PC file.
>> Future
>> files are one thing, but corrupted files should be treated more carefully.
>
>
> Again without looking at it, I agree (so much easier that way :P). Ignoring
> corruption is generally a bad idea. Failing hard makes the user notice the
> error, and makes it possible to initiate recovery from a standby or from
> backups or something, or to *intentionally* remove/clear/ignore it.


And I am finishing with the two patches attached:
- 0001 changes the 2PC checks so as corrupted entries are FATAL.
PreScanPreparedTransaction is used when a hot standby is initialized.
In this case a failure protects the range of XIDs generated,
potentially saving from corruption of data. At the end of recovery,
this is done before any on-disk actions are taken.
- 0002 is the thing that Heikki has sent previously to minimize the
window between end-of-recovery record write and timeline history file
archiving.

I am attaching that to next CF.
-- 
Michael

From 0e816454c57b851f4aa2743ea776b1e604a27d77 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Tue, 20 Dec 2016 16:47:35 +0900
Subject: [PATCH 2/2] Minimize window between history file and end-of-recovery
 record

Once a standby node is promoted, this makes the assignment of the new
timeline number booked earlier as the history file gets archived immediately.
This way the other nodes are aware that this new timeline number is taken
and should not be assigned to other nodes.

The window between which the history file is archived and the end-of-recovery
record is written cannot be zeroed, but this way it is minimized as much as
possible. The new order of actions prevents as well a corrupted data directory
on failure.
---
 src/backend/access/transam/xlog.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ea2c523e6e..f23feb47e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7210,6 +7210,24 @@ StartupXLOG(void)
 		else
 			snprintf(reason, sizeof(reason), "no recovery target specified");
 
+		/*
+		 * We are now done reading the old WAL.  Turn off archive fetching if it
+		 * was active, and make a writable copy of the last WAL segment. (Note
+		 * that we also have a copy of the last block of the old WAL in readBuf;
+		 * we will use that below.)
+		 */
+		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby servers.
+		 * If we crash before actually switching to the new timeline, standby
+		 * servers will nevertheless think that we switched to the new timeline,
+		 * and will try to connect to the new timeline. To minimize the window
+		 * for that, try to do as little as possible between here and writing the
+		 * end-of-recovery record.
+		 */
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
 							 EndRecPtr, reason);
 	}
@@ -7219,15 +7237,6 @@ StartupXLOG(void)
 	XLogCtl->PrevTimeLineID = PrevTimeLineID;
 
 	/*
-	 * We are now done reading the old WAL.  Turn off archive fetching if it
-	 * was active, and make a writable copy of the last WAL segment. (Note
-	 * that we also have a copy of the last block of the old WAL in readBuf;
-	 * we will use that below.)
-	 */
-	if (ArchiveRecoveryRequested)
-		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
 	 * Prepare to write WAL starting at EndOfLog position, and init xlog
 	 * buffer cache using the block containing the last record from the
 	 * previous incarnation.
-- 
2.11.0

From 4037db778a4353fc79b15eb683a5b9e3f648db24 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Tue, 20 Dec 2016 16:41:44 +0900
Subject: [PATCH 1/2] Change detection of corrupted 2PC files as FATAL

When scanning 2PC files, be it for initializing a hot standby node or
finding the XID range at the end of recovery, bumping on a corrupted
2PC file is reported as a WARNING and are blindly corrupted. If it
happened that the corrupted file was actually legit, there is actually
a risk of corruption, so switch that to a hard failure.
---
 src/backend/access/transam/twophase.c | 23 ++++++++---------------
 src/backend/access/transam/xlog.c     | 10 +++++++---
 2 files changed, 15 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5415604993..9cbcb83062 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1672,6 +1672,10 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * write a WAL entry, and so there might be no evidence in WAL of those
  * subxact XIDs.
  *
+ * On corrupted two-phase files, fail immediately. Keeping around broken
+ * entries and let replay continue causes harm on the system, and likely
+ * a new backup should be rolled in.
+ *
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return ShmemVariableCache->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
@@ -1723,25 +1727,14 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 			/* Read and validate file */
 			buf = ReadTwoPhaseFile(xid, true);
 			if (buf == NULL)
-			{
-				ereport(WARNING,
-					  (errmsg("removing corrupt two-phase state file \"%s\"",
-							  clde->d_name)));
-				RemoveTwoPhaseFile(xid, true);
-				continue;
-			}
+				ereport(FATAL, (errmsg("corrupted two-phase file \"%s\"",
+									   clde->d_name)));
 
 			/* Deconstruct header */
 			hdr = (TwoPhaseFileHeader *) buf;
 			if (!TransactionIdEquals(hdr->xid, xid))
-			{
-				ereport(WARNING,
-					  (errmsg("removing corrupt two-phase state file \"%s\"",
-							  clde->d_name)));
-				RemoveTwoPhaseFile(xid, true);
-				pfree(buf);
-				continue;
-			}
+				ereport(FATAL, (errmsg("corrupted two-phase file \"%s\"",
+									   clde->d_name)));
 
 			/*
 			 * OK, we think this file is valid.  Incorporate xid into the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index aa9ee5a0dd..ea2c523e6e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7150,6 +7150,13 @@ StartupXLOG(void)
 	}
 
 	/*
+	 * Pre-scan prepared transactions to find out the range of XIDs present.
+	 * We don't need this information quite yet, but if it fails for some
+	 * reason, better to fail before we make any on-disk changes.
+	 */
+	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+
+	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
 	 * If we are doing an archive recovery, we always assign a new ID.  This
@@ -7272,9 +7279,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	/* Pre-scan prepared transactions to find out the range of XIDs present */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
-
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
-- 
2.11.0

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

Reply via email to