On Mon, Mar 21, 2022 at 11:53 PM Robert Haas <robertmh...@gmail.com> wrote: > > On Mon, Mar 21, 2022 at 11:21 AM Dilip Kumar <dilipbal...@gmail.com> wrote: > > I tried to debug the case but I realized that somehow > > CHECK_FOR_INTERRUPTS() is not calling the > > AcceptInvalidationMessages() and I could not find the same while > > looking into the code as well. While debugging I noticed that > > AcceptInvalidationMessages() is called multiple times but that is only > > through LockRelationId() but while locking the relation we had already > > closed the previous smgr because at a time we keep only one smgr open. > > And that's the reason it is not hitting the issue which we think it > > could. Is there any condition under which it will call > > AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because > > I could not see while debugging as well as in code. > > Yeah, I think the reason you can't find it is that it's not there. I > was confused in what I wrote earlier. I think we only process sinval > catchups when we're idle, not at every CHECK_FOR_INTERRUPTS(). And I > think the reason for that is precisely that it would be hard to write > correct code otherwise, since invalidations might then get processed > in a lot more places. So ... I guess all we really need to do here is > avoid assuming that the results of smgropen() are valid across any > code that might acquire relation locks. Which I think the code already > does. > > But on a related note, why doesn't CreateDatabaseUsingWalLog() acquire > locks on both the source and destination relations? It looks like > you're only taking locks for the source template database ... but I > thought the intention here was to make sure that we didn't pull pages > into shared_buffers without holding a lock on the relation and/or the > database? I suppose the point is that while the template database > might be concurrently dropped, nobody can be doing anything > concurrently to the target database because nobody knows that it > exists yet. Still, I think that this would be the only case where we > let pages into shared_buffers without a relation or database lock, > though maybe I'm confused about this point, too. If not, perhaps we > should consider locking the target database OID and each relation OID > as we are copying it? > > I guess I'm imagining that there might be more code pathways in the > future that want to ensure that there are no remaining buffers for > some particular database or relation OID. It seems natural to want to > be able to take some lock that prevents buffers from being added, and > then go and get rid of all the ones that are there already. But I > admit I can't quite think of a concrete case where we'd want to do > something like this where the patch as coded would be a problem. I'm > just thinking perhaps taking locks is fairly harmless and might avoid > some hypothetical problem later. > > Thoughts?
I think this make sense. I haven't changed the original patch as you told you were improving on some comments, so in order to avoid conflict I have created this add on patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index 9636688..5d0750f 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -460,12 +460,6 @@ CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_ts Assert(rnodelist != NIL); /* - * Database id is common for all the relation so set it before entering to - * the loop. - */ - relid.dbId = src_dboid; - - /* * Iterate over each relfilenode and copy the relation data block by block * from source database to the destination database. */ @@ -488,7 +482,15 @@ CreateDatabaseUsingWalLog(Oid src_dboid, Oid dst_dboid, Oid src_tsid, Oid dst_ts dstrnode.dbNode = dst_dboid; dstrnode.relNode = srcrnode.relNode; - /* Acquire the lock on relation before start copying. */ + /* + * Acquire relation lock on the source and the destination relation id + * before start copying. + */ + relid.dbId = src_dboid; + relid.relId = relinfo->reloid; + LockRelationId(&relid, AccessShareLock); + + relid.dbId = dst_dboid; relid.relId = relinfo->reloid; LockRelationId(&relid, AccessShareLock); @@ -1218,6 +1220,17 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) InvokeObjectPostCreateHook(DatabaseRelationId, dboid, 0); /* + * Acquire a lock on the target database, although this is a new database + * and no one else should be able to see it. But if we are using wal log + * strategy then we are going to access the relation pages using shared + * buffers. Therefore, it would be better to take the database lock. And, + * later we would acquire the relation lock as and when we would access the + * individual relations' shared buffers. + */ + if (dbstrategy == CREATEDB_WAL_LOG) + LockSharedObject(DatabaseRelationId, dboid, 0, ExclusiveLock); + + /* * Once we start copying subdirectories, we need to be able to clean 'em * up if we fail. Use an ENSURE block to make sure this happens. (This * is not a 100% solution, because of the possibility of failure during @@ -1342,6 +1355,10 @@ createdb_failure_callback(int code, Datum arg) { DropDatabaseBuffers(fparms->dest_dboid); ForgetDatabaseSyncRequests(fparms->dest_dboid); + + /* Release lock on the target database. */ + UnlockSharedObject(DatabaseRelationId, fparms->src_dboid, 0, + ExclusiveLock); } /*