On Thu, Jun 18, 2026 at 11:20 AM Amit Kapila <[email protected]> wrote: > > On Wed, Jun 17, 2026 at 12:59 PM Xuneng Zhou <[email protected]> wrote: > > > > On Tue, Jun 16, 2026 at 8:46 PM Fujii Masao <[email protected]> wrote: > > > > > > On Fri, Jun 12, 2026 at 7:54 PM Amit Kapila <[email protected]> > > > wrote: > > > > I feel even if there is an argument to do such a refactoring, it can > > > > be done separately. We can push forward with 0001 and then do more > > > > discussion for 0002, if required. I can take care of 0001 unless > > > > Fujii-San wishes to take care of it? > > > > > > Yeah, please feel free to work on 0001. > > > > > > Regarding 0002, since the race is very rare and non-fatal, I'm okay > > > with accepting the risk rather than adding more refactoring just to > > > avoid it. > > > > > > I'm a bit tempted to add a source comment explaining the risk and > > > why we accept it, though, so other developers can understand > > > the tradeoff. For example: > > > > > > diff --git a/src/backend/replication/logical/slotsync.c > > > b/src/backend/replication/logical/slotsync.c > > > index 05637344363..ca49f20e7d9 100644 > > > --- a/src/backend/replication/logical/slotsync.c > > > +++ b/src/backend/replication/logical/slotsync.c > > > @@ -560,6 +560,12 @@ drop_local_obsolete_slots(List *remote_slot_list) > > > * the same shared memory as that of > > > 'local_slot'. Thus check if > > > * local_slot is still the synced one before > > > performing the actual > > > * drop. > > > + * > > > + * Because local_slot still points to a > > > reusable slot-array entry, > > > + * fields such as name or database OID could > > > already be stale here. > > > + * That could cause an incorrect cleanup > > > decision for this cycle or > > > + * briefly lock an unrelated database. We > > > accept that risk because > > > + * this race is rare and non-fatal. > > > */ > > > SpinLockAcquire(&local_slot->mutex); > > > synced_slot = local_slot->in_use && > > > local_slot->data.synced; > > > > Thanks for suggesting the comment! It helps to clarify the situation > > and the trade-off we made here. I tweaked it a bit and added it to the > > patches prepared by Zhijie. > > > > + * > + * We cannot close this window by holding > + * ReplicationSlotControlLock while taking the database lock, > + * because the database-drop path holds the database lock and then > + * scans replication slots. > > The database-drop path acquires ReplicationSlotControlLock in shared > mode, so not sure if the above is completely correct, here you are > going in the direction of trying to defend that no easy solution > exists which needs more thought.
I agree that this phrasing could be misleading. The dangerous part is in the later actual slot drop. Eventually the database-drop path needs ReplicationSlotControlLock in exclusive mode to set slot->in_use = false. If the slotsync holds ReplicationSlotControlLock in shared mode while it tries to take DB AccessShareLock. Then this would lead to a deadlock. On second thought, I’m inclined not to add a comment explaining the constraint here, since it would be difficult to do so concisely and it's not directly related to the issue. > Fujii-San's proposal was better but > there also we may need to be a bit more specific about "That could > cause an incorrect cleanup decision ...", otherwise, it makes the > comment unclear. OK, how about elaborate it a bit like this: /* * In the small window between getting the slot to drop and * locking the database, there is a possibility of a parallel * database drop by the startup process and the creation of a new * slot by the user. This new user-created slot may end up using * the same shared memory as that of 'local_slot'. * * If that happens, local_slot now describes the replacement slot: * local_sync_slot_required() may have made its drop decision using * the replacement slot's name or invalidation state, and slot_database * may refer to the replacement slot's database. Thus check if * local_slot is still a synced slot before performing the actual drop. * This does not prove it is the original slot, but it prevents dropping * an ordinary user-created replacement slot, and the copied database OID * keeps lock/unlock symmetric. The remaining risk is limited to this * cleanup cycle, such as briefly holding an unrelated database lock, and * is acceptable here because this race is rare. */ > I am planning to commit and backpatch the fix for the first problem > based on what Hou-San has shared (v2-*), then we can discuss how to > improve the existing comments and if we agree on something, that can > be a HEAD-only patch. Thanks! -- Regards, Xuneng Zhou HighGo Software Co., Ltd.
