Thanks Shveta for coming on this point again and fixing the link. The idea is to check if the slot has same name to try to resynchronize it with the primary. ok the check on the failover status for the remote slot is perhaps redundant. I'm not sure what impact setting the synced flag to true might have. But if you run an additional switchover, it works fine because the synced flag on the new primary is set to true now. If we come back to the idea of the GUC or the API, adding an allow_overwrite parameter to the pg_create_logical_replication_slot function and removing the logical slot when set to true could be a suitable approach.
What is your opinion? Regards, Fabrice On Fri, Aug 8, 2025 at 7:39 AM shveta malik <shveta.ma...@gmail.com> wrote: > On Thu, Aug 7, 2025 at 6:50 PM Fabrice Chapuis <fabrice636...@gmail.com> > wrote: > > > > Hi, > > > > An issue occurred during the initial switchover using PostgreSQL version > 17.5. The setup consists of a cluster with two nodes, managed by Patroni > version 4.0.5. > > Logical replication is configured on the same instance, and the new > feature enabling logical replication slots to be failover-safe in a highly > available environment is used. Logical slot management is currently > disabled in Patroni. > > > > Following are some screen captured during the swichover > > > > 1. Run the switchover with Patroni > > > > patronictl switchover > > > > Current cluster topology > > > > + Cluster: ClusterX (7529893278186104053) ----+----+-----------+ > > > > | Member | Host | Role | State | TL | Lag in MB | > > > > +----------+--------------+---------+-----------+----+-----------+ > > > > | node_1 | xxxxxxxxxxxx | Leader | running | 4 | | > > > > | node_2 | xxxxxxxxxxxx | Replica | streaming | 4 | 0 | > > > > +----------+--------------+---------+-----------+----+-----------+ > > > > 2. Check the slot on the new Primary > > > > select * from pg_replication_slots where slot_type = 'logical'; > > +-[ RECORD 1 ]--------+----------------+ > > | slot_name | logical_slot | > > | plugin | pgoutput | > > | slot_type | logical | > > | datoid | 25605 | > > | database | db_test | > > | temporary | f | > > | active | t | > > | active_pid | 3841546 | > > | xmin | | > > | catalog_xmin | 10399 | > > | restart_lsn | 0/37002410 | > > | confirmed_flush_lsn | 0/37002448 | > > | wal_status | reserved | > > | safe_wal_size | | > > | two_phase | f | > > | inactive_since | | > > | conflicting | f | > > | invalidation_reason | | > > | failover | t | > > | synced | t | > > +---------------------+----------------+ > > Logical replication is active again after the promote > > > > 3. Check the slot on the new standby > > select * from pg_replication_slots where slot_type = 'logical'; > > +-[ RECORD 1 ]--------+-------------------------------+ > > | slot_name | logical_slot | > > | plugin | pgoutput | > > | slot_type | logical | > > | datoid | 25605 | > > | database | db_test | > > | temporary | f | > > | active | f | > > | active_pid | | > > | xmin | | > > | catalog_xmin | 10397 | > > | restart_lsn | 0/3638F5F0 | > > | confirmed_flush_lsn | 0/3638F6A0 | > > | wal_status | reserved | > > | safe_wal_size | | > > | two_phase | f | > > | inactive_since | 2025-08-05 10:21:03.342587+02 | > > | conflicting | f | > > | invalidation_reason | | > > | failover | t | > > | synced | f | > > +---------------------+--------------------------- > > > > The synced flag keep value false. > > Following error in in the log > > 2025-06-10 16:40:58.996 CEST [739829]: [1-1] > user=,db=,client=,application= LOG: slot sync worker started > > 2025-06-10 16:40:59.011 CEST [739829]: [2-1] > user=,db=,client=,application= ERROR: exiting from slot synchronization > because same name slot "logical_slot" already exists on the standby > > > > I would like to make a proposal to address the issue: > > Since the logical slot is in a failover state on both the primary and > the standby, an attempt could be made to resynchronize them. > > I modify the slotsync.c module > > +++ b/src/backend/replication/logical/slotsync.c > > @@ -649,24 +649,46 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid > remote_dbid) > > > > return false; > > } > > - > > - /* Search for the named slot */ > > + // Both local and remote slot have the same name > > if ((slot = SearchNamedReplicationSlot(remote_slot->name, true))) > > { > > bool synced; > > + bool failover_status = remote_slot->failover; > > > > SpinLockAcquire(&slot->mutex); > > synced = slot->data.synced; > > SpinLockRelease(&slot->mutex); > > + > > + if (!synced){ > > + > > + Assert(!MyReplicationSlot); > > + > > + if (failover_status){ > > + > > + > ReplicationSlotAcquire(remote_slot->name, true, true); > > + > > + // Put the synced flag to true to > attempt resynchronizing failover slot on the standby > > + MyReplicationSlot->data.synced = true; > > + > > + ReplicationSlotMarkDirty(); > > > > - /* User-created slot with the same name exists, raise > ERROR. */ > > - if (!synced) > > - ereport(ERROR, > > + ReplicationSlotRelease(); > > + > > + /* Get rid of a replication slot that is > no longer wanted */ > > + ereport(WARNING, > > + > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("slot \"%s\" local slot > has the same name as remote slot and they are in failover mode, try to > synchronize them", > > + remote_slot->name)); > > + return false; /* Going back to the main > loop after droping the failover slot */ > > + } > > + else > > + /* User-created slot with the same name > exists, raise ERROR. */ > > + ereport(ERROR, > > > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > errmsg("exiting from slot > synchronization because same" > > - " name slot \"%s\" > already exists on the standby", > > - remote_slot->name)); > > - > > + " name slot > \"%s\" already exists on the standby", > > + > remote_slot->name)); > > + } > > /* > > * The slot has been synchronized before. > > * > > This message follows the discussions started in this thread: > > > https://www.postgresql.org/message-id/CAA5-nLDvnqGtBsKu4T_s-cS%2BdGbpSLEzRwgep1XfYzGhQ4o65A%40mail.gmail.com > > > > Help would be appreciated to move this point forward > > > > Thank You for starting a new thread and working on it. > > I think you have given reference to the wrong thread, the correct one > is [1]. > > IIUC, the proposed fix is checking if remote_slot is failover-enabled > and the slot with the same name exists locally but has 'synced'=false, > enable the 'synced' flag and proceed with synchronization from next > cycle onward, else error out. But remote-slot's failover will always > be true if we have reached this stage. Did you actually mean to check > if the local slot is failover-enabled but has 'synced' set to false > (indicating it’s a new standby after a switchover)? Even with that > check, it might not be the correct thing to overwrite such a slot > internally. I think in [1], we discussed a couple of ideas related to > a GUC, alter-api, drop_all_slots API, but I don't see any of those > proposed here. Do we want to try any of that? > > [1]: > > https://www.postgresql.org/message-id/flat/CAA5-nLD0vKn6T1-OHROBNfN2Pxa17zVo4UoVBdfHn2y%3D7nKixA%40mail.gmail.com > > thanks > Shveta >