Hi Fujii-san, On Wed, Jun 3, 2026 at 8:03 PM Fujii Masao <[email protected]> wrote: > > On Tue, Jun 2, 2026 at 3:00 PM Xuneng Zhou <[email protected]> wrote: > > > > Hi, > > > > On Sat, May 30, 2026 at 4:14 PM Zhijie Hou (Fujitsu) > > <[email protected]> wrote: > > > I'm not sure if adding an injection point for this rare case is > > > worthwhile. Even > > > if we were to add one, future refactoring of that function could shift the > > > position of the injection point, so its long-term usefulness is > > > uncertain. I > > > don't have a strong opinion on this, so I'll leave it to Fujii-San to > > > decide. > > I've pushed the patch. Thanks! > > IMO the proposed test looks a bit too narrow, so I'm not sure it's worth > adding at this point. For now, I've committed only the code fix. > > > > There's an adjacent bug around drop_local_obsolete_slots. The root > > cause of them looks similar -- ReplicationSlot * is a pointer to a > > reusable shared-memory array cell, not a durable identity for the same > > slot. In drop_local_obsolete_slots, the issue is that the slot has > > been freed after ReplicationSlotDropAcquired(false); however, another > > backend may reuse the same cell before the unlock/log reads. This > > seems less severe -- it does not normally corrupt slot state, because > > the code only read after the drop. But it can still unlock/log > > misusing the identity of a different slot. Attached a test using > > injection point to reproduce it and a patch to fix it. > > Thanks again for the report and patch! > > /* Drop the local slot if it is not required to be retained. */ > if (!local_sync_slot_required(local_slot, remote_slot_list)) > { > + bool dropped = false; > + NameData slot_name = {0}; > + Oid slot_database = local_slot->data.database; > bool synced_slot; > > Is it really safe to read slot_database before acquiring the database lock?
Reading slot_database before taking the database lock seems not inherently unsafe by itself. The comment suggests that the lock is primarily used to prevent conflicts with the startup process running ReplicationSlotsDropDBSlots() during db-drop replay; it does not protect replication slot array reuse. The unsafe part could be reading slot_database from local_slot after ReplicationSlotControlLock has been released. At this point, the slot array cell may already have been freed and reused, so the value read may no longer belong to the slot that get_local_synced_slots() originally collected. As a result, we could end up locking the wrong database. There seems to be two related issues: 1) Before drop: reading local_slot->data.database / local_slot->data.name after the slot-array lock was released, before verifying the cell still represents the same synced slot. 2) After drop: calling ReplicationSlotDropAcquired(false) and then reading local_slot->data.database / local_slot->data.name for unlocking/logging after the cell may have been reused by another backend. The prior patch targets to fix 2), but leaves 1) unfixed. > BTW, I'm also wondering whether it's safe for > local_sync_slot_required() to access local_slot without holding a lock. I share the same concern here. The issue smells similar to the one discussed above -- reading values from the array cell directly without the protection of array lock. To solve them altogether, it might be better to stop carrying raw ReplicationSlot * values across unprotected windows. We can get the snapshot values such as slot name & database oid from get_local_synced_slots() instead. Then local_sync_slot_required() works from the snapshot, and drop_local_obsolete_slots() uses the copied database OID to take the database lock. Before dropping, it should revalidate by copied name/database that the slot is still a synced logical slot, then acquire/drop by copied name, and use only copied values for unlock/logging. I plan to prepare a refactoring patch for this. Does that seem like the right direction to you? -- Regards, Xuneng Zhou HighGo Software Co., Ltd.
