Hi, I'm starting a new thread for this patch that originated as a side-discussion in [1], to give it its own CF entry in the next cycle. This is a WIP with an open question to research: what could actually break if we did this?
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com
From 61a15ed286a1fd824b4e2b4b689cbe6688930e6e Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Tue, 2 Mar 2021 16:09:51 +1300 Subject: [PATCH] Make relfile tombstone files conditional on WAL level. Traditionally we have left behind an empty file to be unlinked after the next checkpoint, when dropping a relation. That would prevent GetNewRelFileNode() from recycling the relfile, avoiding a schedule that could corrupt data in crash recovery, with wal_level=minimal. Since the default wal_level changed to replica in release 10, and since this mechanism introduces costs elsewhere, let's only do it if we have to. In particular, this avoids the need for DROP TABLESPACE for force an expensive checkpoint just to clear out tombstone files. XXX What would break if we did this? Discussion: https://postgr.es/m/CA%2BhUKGLT3zibuLkn_j9xiPWn6hxH9Br-TsJoSaFgQOpxpEUnPQ%40mail.gmail.com --- src/backend/commands/tablespace.c | 14 +++++++++++--- src/backend/storage/smgr/md.c | 10 ++++++---- src/include/access/xlog.h | 6 ++++++ 3 files changed, 23 insertions(+), 7 deletions(-) diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c index 69ea155d50..c1d12f3d19 100644 --- a/src/backend/commands/tablespace.c +++ b/src/backend/commands/tablespace.c @@ -512,8 +512,10 @@ DropTableSpace(DropTableSpaceStmt *stmt) */ if (!destroy_tablespace_directories(tablespaceoid, false)) { + bool try_again = false; + /* - * Not all files deleted? However, there can be lingering empty files + * Not all files deleted? However, there can be lingering tombstones * in the directories, left behind by for example DROP TABLE, that * have been scheduled for deletion at next checkpoint (see comments * in mdunlink() for details). We could just delete them immediately, @@ -528,8 +530,14 @@ DropTableSpace(DropTableSpaceStmt *stmt) * TABLESPACE should not give up on the tablespace becoming empty * until all relevant invalidation processing is complete. */ - RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); - if (!destroy_tablespace_directories(tablespaceoid, false)) + + if (XLogNeedRelFileTombstones()) + { + RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); + try_again = true; + } + + if (!try_again || !destroy_tablespace_directories(tablespaceoid, false)) { /* Still not empty, the files must be important then */ ereport(ERROR, diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 1e12cfad8e..447122519e 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -236,8 +236,9 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo) * forkNum can be a fork number to delete a specific fork, or InvalidForkNumber * to delete all forks. * - * For regular relations, we don't unlink the first segment file of the rel, - * but just truncate it to zero length, and record a request to unlink it after + * For regular relations, we don't always unlink the first segment file, + * depending on the WAL level. If XLogNeedRelFileTombstones() is true, we + * just truncate it to zero length, and record a request to unlink it after * the next checkpoint. Additional segments can be unlinked immediately, * however. Leaving the empty file in place prevents that relfilenode * number from being reused. The scenario this protects us from is: @@ -321,7 +322,8 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo) /* * Delete or truncate the first segment. */ - if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode)) + if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode) || + !XLogNeedRelFileTombstones()) { if (!RelFileNodeBackendIsTemp(rnode)) { @@ -349,7 +351,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo) /* Prevent other backends' fds from holding on to the disk space */ ret = do_truncate(path); - /* Register request to unlink first segment later */ + /* Leave the file as a tombstone, to be unlinked at checkpoint time. */ register_unlink_segment(rnode, forkNum, 0 /* first seg */ ); } diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 75ec1073bd..cca04a6aa8 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -207,6 +207,12 @@ extern PGDLLIMPORT int wal_level; /* Do we need to WAL-log information required only for logical replication? */ #define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL) +/* + * Is the WAL-level so low that it is unsafe to recycle relfilenodes between + * checkpoints? See mdunlinkfork(). + */ +#define XLogNeedRelFileTombstones() (wal_level < WAL_LEVEL_REPLICA) + #ifdef WAL_DEBUG extern bool XLOG_DEBUG; #endif -- 2.30.0