Hi hackers,

I am splitting this off of a previous thread aimed at reducing archiving
overhead [0], as I believe this fix might deserve back-patching.

Presently, WAL recycling uses durable_rename_excl(), which notes that a
crash at an unfortunate moment can result in two links to the same file.
My testing [1] demonstrated that it was possible to end up with two links
to the same file in pg_wal after a crash just before unlink() during WAL
recycling.  Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL file was
re-recycled upon restarting.  This seems likely to lead to WAL corruption.

The attached patch prevents this problem by using durable_rename() instead
of durable_rename_excl() for WAL recycling.  This removes the protection
against accidentally overwriting an existing WAL file, but there shouldn't
be one.

This patch also sets the stage for reducing archiving overhead (as
discussed in the other thread [0]).  The proposed change to reduce
archiving overhead will make it more likely that the server will attempt to
re-archive segments after a crash.  This might lead to archive corruption
if the server concurrently writes to the same file via the aforementioned
bug.

[0] https://www.postgresql.org/message-id/20220222011948.GA3850532%40nathanxps13
[1] https://www.postgresql.org/message-id/20220222173711.GA3852671%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 244726f6a78aca52c2fe6e70cef966f152057191 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandboss...@gmail.com>
Date: Thu, 7 Apr 2022 10:07:42 -0700
Subject: [PATCH v1 1/1] Avoid multiple hard links to same WAL file after a
 crash.

Presently, WAL recycling uses durable_rename_excl(), which notes that a crash at
an unfortunate moment can result in two links to the same file.  My testing
demonstrated that it was possible to end up with two links to the same file in
pg_wal after a crash just before unlink() during WAL recycling.  Specifically,
the test produced links to the same file for the current WAL file and the next
one because the half-recycled WAL file was re-recycled upon restarting.  This
seems likely to lead to WAL corruption.

This change prevents this problem by using durable_rename() instead of
durable_rename_excl() for WAL recycling.  This removes the protection against
accidentally overwriting an existing WAL file, but there shouldn't be one.

Back-patch to all supported versions.

Author: Nathan Bossart
---
 src/backend/access/transam/xlog.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6770c3ddba..6ab5b2a622 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3324,13 +3324,15 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	}
 
 	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
+	 * Perform the rename.  Ideally, we'd use link() and unlink() to avoid
+	 * overwriting an existing file (there shouldn't be one).  However, that
+	 * approach opens up the possibility that pg_wal will contain multiple hard
+	 * links to the same WAL file after a crash.
 	 */
-	if (durable_rename_excl(tmppath, path, LOG) != 0)
+	if (durable_rename(tmppath, path, LOG) != 0)
 	{
 		LWLockRelease(ControlFileLock);
-		/* durable_rename_excl already emitted log message */
+		/* durable_rename already emitted log message */
 		return false;
 	}
 
-- 
2.25.1

Reply via email to