Hi,

I'm starting a new thread for this patch that originated as a
side-discussion in [1], to give it its own CF entry in the next cycle.
This is a WIP with an open question to research: what could actually
break if we did this?

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com
From 61a15ed286a1fd824b4e2b4b689cbe6688930e6e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 2 Mar 2021 16:09:51 +1300
Subject: [PATCH] Make relfile tombstone files conditional on WAL level.

Traditionally we have left behind an empty file to be unlinked after the
next checkpoint, when dropping a relation.  That would prevent
GetNewRelFileNode() from recycling the relfile, avoiding a schedule
that could corrupt data in crash recovery, with wal_level=minimal.

Since the default wal_level changed to replica in release 10, and since
this mechanism introduces costs elsewhere, let's only do it if we have
to.  In particular, this avoids the need for DROP TABLESPACE for force
an expensive checkpoint just to clear out tombstone files.

XXX What would break if we did this?

Discussion: https://postgr.es/m/CA%2BhUKGLT3zibuLkn_j9xiPWn6hxH9Br-TsJoSaFgQOpxpEUnPQ%40mail.gmail.com
---
 src/backend/commands/tablespace.c | 14 +++++++++++---
 src/backend/storage/smgr/md.c     | 10 ++++++----
 src/include/access/xlog.h         |  6 ++++++
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 69ea155d50..c1d12f3d19 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -512,8 +512,10 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 	 */
 	if (!destroy_tablespace_directories(tablespaceoid, false))
 	{
+		bool		try_again = false;
+
 		/*
-		 * Not all files deleted?  However, there can be lingering empty files
+		 * Not all files deleted?  However, there can be lingering tombstones
 		 * in the directories, left behind by for example DROP TABLE, that
 		 * have been scheduled for deletion at next checkpoint (see comments
 		 * in mdunlink() for details).  We could just delete them immediately,
@@ -528,8 +530,14 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 		 * TABLESPACE should not give up on the tablespace becoming empty
 		 * until all relevant invalidation processing is complete.
 		 */
-		RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
-		if (!destroy_tablespace_directories(tablespaceoid, false))
+
+		if (XLogNeedRelFileTombstones())
+		{
+			RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
+			try_again = true;
+		}
+
+		if (!try_again || !destroy_tablespace_directories(tablespaceoid, false))
 		{
 			/* Still not empty, the files must be important then */
 			ereport(ERROR,
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1e12cfad8e..447122519e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -236,8 +236,9 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
  * forkNum can be a fork number to delete a specific fork, or InvalidForkNumber
  * to delete all forks.
  *
- * For regular relations, we don't unlink the first segment file of the rel,
- * but just truncate it to zero length, and record a request to unlink it after
+ * For regular relations, we don't always unlink the first segment file,
+ * depending on the WAL level.  If XLogNeedRelFileTombstones() is true, we
+ * just truncate it to zero length, and record a request to unlink it after
  * the next checkpoint.  Additional segments can be unlinked immediately,
  * however.  Leaving the empty file in place prevents that relfilenode
  * number from being reused.  The scenario this protects us from is:
@@ -321,7 +322,8 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 	/*
 	 * Delete or truncate the first segment.
 	 */
-	if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode))
+	if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode) ||
+		!XLogNeedRelFileTombstones())
 	{
 		if (!RelFileNodeBackendIsTemp(rnode))
 		{
@@ -349,7 +351,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 		/* Prevent other backends' fds from holding on to the disk space */
 		ret = do_truncate(path);
 
-		/* Register request to unlink first segment later */
+		/* Leave the file as a tombstone, to be unlinked at checkpoint time. */
 		register_unlink_segment(rnode, forkNum, 0 /* first seg */ );
 	}
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..cca04a6aa8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -207,6 +207,12 @@ extern PGDLLIMPORT int wal_level;
 /* Do we need to WAL-log information required only for logical replication? */
 #define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL)
 
+/*
+ * Is the WAL-level so low that it is unsafe to recycle relfilenodes between
+ * checkpoints?  See mdunlinkfork().
+ */
+#define XLogNeedRelFileTombstones() (wal_level < WAL_LEVEL_REPLICA)
+
 #ifdef WAL_DEBUG
 extern bool XLOG_DEBUG;
 #endif
-- 
2.30.0

Reply via email to