Hi, I updated the patch with the following changes:
- Remove the assertion from smgrtruncate() - it would need to assert that it's called in a critical section. Not sure why it's not already asserting that? The function header says: * ... This function should normally * be called in a critical section, but the current size must be checked * outside the critical section, and no interrupts or smgr functions relating * to this relation should be called in between. The "should normally" is bit weird imo, when would it be safe to *not* use it in a critical section? - added comments about the reason for HOLD_INTERRUPTS to smgrdounlinkall(), smgrdestroyall() and smgrreleaseall() - moved the HOLD_INTERRUPTS after FlushRelationsAllBuffers() in smgrdosyncall - updated the comment & commit message talking about the problem being due to debug elogs/ereports - it's also LOG/WARNING. I was writing a remark in the commit message, explaining that the only < ERROR elog/ereport that can be reached is very unlikely to be reachable, and therefore it's not worth the risk of backpatching. But it turns out there also are WARNINGs in mdunlinkfork(), which seem a lot easier to reach than the DEBUG1 in register_dirty_segment(). I still am leaning against backpatching, but I'm not sure that's not just laziness. On 2025-03-19 17:45:14 -0700, Noah Misch wrote: > On Wed, Mar 19, 2025 at 06:45:20PM -0400, Andres Freund wrote: > > On 2025-03-19 12:55:53 -0700, Noah Misch wrote: > > > On Mon, Mar 17, 2025 at 07:52:02PM -0400, Andres Freund wrote: > > > > @@ -434,6 +481,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels) > > > > if (nrels == 0) > > > > return; > > > > > > > > + HOLD_INTERRUPTS(); > > > > + > > > > FlushRelationsAllBuffers(rels, nrels); > > > > > > FlushRelationsAllBuffers() isn't part of smgr or md.c, so it's unlikely to > > > become sensitive to smgrrelease(). It may do a ton of I/O. Hence, I'd > > > HOLD_INTERRUPTS() after FlushRelationsAllBuffers(), not before. > > > > Hm - we never would want to process interrupts while in > > FlushRelationsAllBuffers() or such, would we? I'm ok with changing it, I > > guess I just didn't see a reason not to use a wider scope. > > If we get a query cancel or fast shutdown, it's better for the user to abort > the transaction rather than keep flushing. smgrDoPendingSyncs() calls here > rather late in the pre-commit actions, so failing is still supposed to be > fine. I think the code succeeds at making it fine to fail here. But we don't actually intentionally accept interrupts in FlushRelationsAllBuffers()? It would only happen as a side-effect of a non-error elog/ereport() processing interrupts, right? It also looks like we couldn't accept interrupts when called by AbortTransaction(), because there we already are in a HOLD_INTERRUPTS() region. I'm pretty sure an error would trigger at least an assertion. But that's really an independent issue. Moved. > > > > @@ -471,6 +522,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool > > > > isRedo) > > > > if (nrels == 0) > > > > return; > > > > > > > > + HOLD_INTERRUPTS(); > > > > > > I would move this below DropRelationsAllBuffers(), for reasons like > > > FlushRelationsAllBuffers() above. > > > > I think that'd be unsafe. Once we dropped buffers from the buffer pool we > > can't just continue without also unlinking the underlying relation, > > otherwise > > an older view of the data later can be "revived from the dead" after an > > error, > > causing all manner of corruption. > > > > I suspect it's always called with interrupts held already though. > > Ah, confirmed. If I put this assert at the top of smgrdounlinkall(), > check-world passes: > > Assert(IsBinaryUpgrade || InRecovery || !INTERRUPTS_CAN_BE_PROCESSED()); I just made it hold interrupts for now, hope that makes sense? > > > I struggle to speculate about the merits of this, because mdopen() can't > > > fail. > > > If mdopen() started to do things that could fail, mdnblocks() would be > > > reasonable in assuming those things are done. Hence, the long-term > > > direction > > > should be more like destroying the new smgr entry in the event of error. > > > > > > I would not make this change. I'd maybe add a comment that smgr_open > > > callbacks currently aren't allowed to fail, since smgropen() isn't ready > > > to > > > clean up smgr-level state from a failed open. How do you see it? > > > > I see no disadvantage in the change - it seems strictly better to initialize > > pincount earlier. I agree that it might be a good idea to explicitly handle > > errors eventually, but that'd not be made harder by this change... > > Okay. I suppose if mdopen() gained the ability to fail and mdnblocks() also > gained the ability to cure said failure, this change would make that okay. FWIW, if mdopen() could fail, it should probably only do fallible operations after doing the non-fallible initialization. Then mdnblocks() wouldn't need to cure anything. Greetings, Andres Freund
>From 316777b4b749d404e3ae40cdb3f844847987ffe8 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Tue, 18 Mar 2025 14:40:05 -0400 Subject: [PATCH v3] smgr: Hold interrupts in most smgr functions We need to hold interrupts across most of the smgr.c/md.c functions, as otherwise interrupt processing, e.g. due to a < ERROR elog/ereport, can trigger procsignal processing, which in turn can trigger smgrreleaseall(). As the relevant code is not reentrant, we quickly end up in a bad situation. The only reason we haven't noticed this before is that there is only one non-error ereport called in affected routines, in register_dirty_segments(), and that one is extremely rarely reached. If one enables fd.c's FDDEBUG it's easy to reproduce crashes. It seems better to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() in smgr.c, instead of trying to push them down to md.c where possible: For one, every smgr implementation would be vulnerable, for another, a good bit of smgr.c code itself is affected too. Eventually we might want a more targeted solution, allowing e.g. a networked smgr implementation to be interrupted, but many other, more complicated, problems would need to be fixed for that to be viable (e.g. smgr.c is often called with interrupts already held). One could argue this should be backpatched, but the existing < ERROR elog/ereports that can be reached with unmodified sources are unlikely to be reached. On balance the risk of backpatching seems higher than the gain - at least for now. Reviewed-by: Noah Misch <n...@leadboat.com> Reviewed-by: Thomas Munro <thomas.mu...@gmail.com> Discussion: https://postgr.es/m/3vae7l5ozvqtxmd7rr7zaeq3qkuipz365u3rtim5t5wdkr6f4g@vkgf2fogjirl --- src/backend/storage/smgr/smgr.c | 104 +++++++++++++++++++++++++++++++- 1 file changed, 101 insertions(+), 3 deletions(-) diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index e6cbb9b4ca2..af74f54b43b 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -40,6 +40,18 @@ * themselves, as there could pointers to them in active use. See * smgrrelease() and smgrreleaseall(). * + * NB: We need to hold interrupts across most of the functions in this file, + * as otherwise interrupt processing, e.g. due to a < ERROR elog/ereport, can + * trigger procsignal processing, which in turn can trigger + * smgrreleaseall(). Most of the relevant code is not reentrant. It seems + * better to put the HOLD_INTERRUPTS()/RESUME_INTERRUPTS() here, instead of + * trying to push them down to md.c where possible: For one, every smgr + * implementation would be vulnerable, for another, a good bit of smgr.c code + * itself is affected too. Eventually we might want a more targeted solution, + * allowing e.g. a networked smgr implementation to be interrupted, but many + * other, more complicated, problems would need to be fixed for that to be + * viable (e.g. smgr.c is often called with interrupts already held). + * * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * @@ -53,6 +65,7 @@ #include "access/xlogutils.h" #include "lib/ilist.h" +#include "miscadmin.h" #include "storage/bufmgr.h" #include "storage/ipc.h" #include "storage/md.h" @@ -158,12 +171,16 @@ smgrinit(void) { int i; + HOLD_INTERRUPTS(); + for (i = 0; i < NSmgr; i++) { if (smgrsw[i].smgr_init) smgrsw[i].smgr_init(); } + RESUME_INTERRUPTS(); + /* register the shutdown proc */ on_proc_exit(smgrshutdown, 0); } @@ -176,11 +193,15 @@ smgrshutdown(int code, Datum arg) { int i; + HOLD_INTERRUPTS(); + for (i = 0; i < NSmgr; i++) { if (smgrsw[i].smgr_shutdown) smgrsw[i].smgr_shutdown(); } + + RESUME_INTERRUPTS(); } /* @@ -206,6 +227,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend) Assert(RelFileNumberIsValid(rlocator.relNumber)); + HOLD_INTERRUPTS(); + if (SMgrRelationHash == NULL) { /* First time through: initialize the hash table */ @@ -242,6 +265,8 @@ smgropen(RelFileLocator rlocator, ProcNumber backend) smgrsw[reln->smgr_which].smgr_open(reln); } + RESUME_INTERRUPTS(); + return reln; } @@ -283,6 +308,8 @@ smgrdestroy(SMgrRelation reln) Assert(reln->pincount == 0); + HOLD_INTERRUPTS(); + for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) smgrsw[reln->smgr_which].smgr_close(reln, forknum); @@ -292,6 +319,8 @@ smgrdestroy(SMgrRelation reln) &(reln->smgr_rlocator), HASH_REMOVE, NULL) == NULL) elog(ERROR, "SMgrRelation hashtable corrupted"); + + RESUME_INTERRUPTS(); } /* @@ -302,12 +331,16 @@ smgrdestroy(SMgrRelation reln) void smgrrelease(SMgrRelation reln) { + HOLD_INTERRUPTS(); + for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++) { smgrsw[reln->smgr_which].smgr_close(reln, forknum); reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber; } reln->smgr_targblock = InvalidBlockNumber; + + RESUME_INTERRUPTS(); } /* @@ -336,6 +369,9 @@ smgrdestroyall(void) { dlist_mutable_iter iter; + /* seems unsafe to accept interrupts while in a dlist_foreach_modify() */ + HOLD_INTERRUPTS(); + /* * Zap all unpinned SMgrRelations. We rely on smgrdestroy() to remove * each one from the list. @@ -347,6 +383,8 @@ smgrdestroyall(void) smgrdestroy(rel); } + + RESUME_INTERRUPTS(); } /* @@ -362,12 +400,17 @@ smgrreleaseall(void) if (SMgrRelationHash == NULL) return; + /* seems unsafe to accept interrupts while iterating */ + HOLD_INTERRUPTS(); + hash_seq_init(&status, SMgrRelationHash); while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL) { smgrrelease(reln); } + + RESUME_INTERRUPTS(); } /* @@ -400,7 +443,13 @@ smgrreleaserellocator(RelFileLocatorBackend rlocator) bool smgrexists(SMgrRelation reln, ForkNumber forknum) { - return smgrsw[reln->smgr_which].smgr_exists(reln, forknum); + bool ret; + + HOLD_INTERRUPTS(); + ret = smgrsw[reln->smgr_which].smgr_exists(reln, forknum); + RESUME_INTERRUPTS(); + + return ret; } /* @@ -413,7 +462,9 @@ smgrexists(SMgrRelation reln, ForkNumber forknum) void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) { + HOLD_INTERRUPTS(); smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo); + RESUME_INTERRUPTS(); } /* @@ -436,6 +487,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels) FlushRelationsAllBuffers(rels, nrels); + HOLD_INTERRUPTS(); + /* * Sync the physical file(s). */ @@ -449,6 +502,8 @@ smgrdosyncall(SMgrRelation *rels, int nrels) smgrsw[which].smgr_immedsync(rels[i], forknum); } } + + RESUME_INTERRUPTS(); } /* @@ -471,6 +526,13 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo) if (nrels == 0) return; + /* + * It would be unsafe to process interrupts between DropRelationBuffers() + * and unlinking the underlying files. This probably should be a critical + * section, but we're not there yet. + */ + HOLD_INTERRUPTS(); + /* * Get rid of any remaining buffers for the relations. bufmgr will just * drop them without bothering to write the contents. @@ -522,6 +584,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo) } pfree(rlocators); + + RESUME_INTERRUPTS(); } @@ -538,6 +602,8 @@ void smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const void *buffer, bool skipFsync) { + HOLD_INTERRUPTS(); + smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum, buffer, skipFsync); @@ -550,6 +616,8 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, reln->smgr_cached_nblocks[forknum] = blocknum + 1; else reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber; + + RESUME_INTERRUPTS(); } /* @@ -563,6 +631,8 @@ void smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, int nblocks, bool skipFsync) { + HOLD_INTERRUPTS(); + smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum, nblocks, skipFsync); @@ -575,6 +645,8 @@ smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, reln->smgr_cached_nblocks[forknum] = blocknum + nblocks; else reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber; + + RESUME_INTERRUPTS(); } /* @@ -588,7 +660,13 @@ bool smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, int nblocks) { - return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks); + bool ret; + + HOLD_INTERRUPTS(); + ret = smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks); + RESUME_INTERRUPTS(); + + return ret; } /* @@ -601,7 +679,13 @@ uint32 smgrmaxcombine(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum) { - return smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum); + uint32 ret; + + HOLD_INTERRUPTS(); + ret = smgrsw[reln->smgr_which].smgr_maxcombine(reln, forknum, blocknum); + RESUME_INTERRUPTS(); + + return ret; } /* @@ -619,8 +703,10 @@ void smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, void **buffers, BlockNumber nblocks) { + HOLD_INTERRUPTS(); smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers, nblocks); + RESUME_INTERRUPTS(); } /* @@ -653,8 +739,10 @@ void smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const void **buffers, BlockNumber nblocks, bool skipFsync) { + HOLD_INTERRUPTS(); smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum, buffers, nblocks, skipFsync); + RESUME_INTERRUPTS(); } /* @@ -665,8 +753,10 @@ void smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, BlockNumber nblocks) { + HOLD_INTERRUPTS(); smgrsw[reln->smgr_which].smgr_writeback(reln, forknum, blocknum, nblocks); + RESUME_INTERRUPTS(); } /* @@ -683,10 +773,14 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum) if (result != InvalidBlockNumber) return result; + HOLD_INTERRUPTS(); + result = smgrsw[reln->smgr_which].smgr_nblocks(reln, forknum); reln->smgr_cached_nblocks[forknum] = result; + RESUME_INTERRUPTS(); + return result; } @@ -784,7 +878,9 @@ smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks, void smgrregistersync(SMgrRelation reln, ForkNumber forknum) { + HOLD_INTERRUPTS(); smgrsw[reln->smgr_which].smgr_registersync(reln, forknum); + RESUME_INTERRUPTS(); } /* @@ -816,7 +912,9 @@ smgrregistersync(SMgrRelation reln, ForkNumber forknum) void smgrimmedsync(SMgrRelation reln, ForkNumber forknum) { + HOLD_INTERRUPTS(); smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); + RESUME_INTERRUPTS(); } /* -- 2.48.1.76.g4e746b1a31.dirty