Re: Issue with logical replication slot during switchover

Fabrice Chapuis Fri, 08 Aug 2025 06:41:01 -0700

Thanks Shveta for coming on this point again and fixing the link.
The idea is to check if the slot has same name to try to resynchronize it
with the primary.
ok the check on the failover status for the remote slot is perhaps
redundant.
I'm not sure what impact setting the synced flag to true might have. But if
you run an additional switchover, it works fine because the synced flag on
the new primary is set to true now.
If we come back to the idea of the GUC or the API, adding an
allow_overwrite parameter to the pg_create_logical_replication_slot
function and removing the logical slot when set to true could be a suitable
approach.


What is your opinion?

Regards,
Fabrice



On Fri, Aug 8, 2025 at 7:39 AM shveta malik <[email protected]> wrote:

> On Thu, Aug 7, 2025 at 6:50 PM Fabrice Chapuis <[email protected]>
> wrote:
> >
> > Hi,
> >
> > An issue occurred during the initial switchover using PostgreSQL version
> 17.5. The setup consists of a cluster with two nodes, managed by Patroni
> version 4.0.5.
> > Logical replication is configured on the same instance, and the new
> feature enabling logical replication slots to be failover-safe in a highly
> available environment is used. Logical slot management is currently
> disabled in Patroni.
> >
> > Following are some screen captured during the swichover
> >
> > 1. Run the switchover with Patroni
> >
> > patronictl switchover
> >
> > Current cluster topology
> >
> > + Cluster: ClusterX   (7529893278186104053) ----+----+-----------+
> >
> > | Member   | Host         | Role    | State     | TL | Lag in MB |
> >
> > +----------+--------------+---------+-----------+----+-----------+
> >
> > | node_1   | xxxxxxxxxxxx | Leader  | running   |  4 |           |
> >
> > | node_2   | xxxxxxxxxxxx | Replica | streaming |  4 |         0 |
> >
> > +----------+--------------+---------+-----------+----+-----------+
> >
> > 2. Check the slot on the new Primary
> >
> > select * from pg_replication_slots where slot_type = 'logical';
> > +-[ RECORD 1 ]--------+----------------+
> > | slot_name           | logical_slot   |
> > | plugin              | pgoutput       |
> > | slot_type           | logical        |
> > | datoid              | 25605          |
> > | database            | db_test        |
> > | temporary           | f              |
> > | active              | t              |
> > | active_pid          | 3841546        |
> > | xmin                |                |
> > | catalog_xmin        | 10399          |
> > | restart_lsn         | 0/37002410     |
> > | confirmed_flush_lsn | 0/37002448     |
> > | wal_status          | reserved       |
> > | safe_wal_size       |                |
> > | two_phase           | f              |
> > | inactive_since      |                |
> > | conflicting         | f              |
> > | invalidation_reason |                |
> > | failover            | t              |
> > | synced              | t              |
> > +---------------------+----------------+
> > Logical replication is active again after the promote
> >
> > 3. Check the slot on the new standby
> > select * from pg_replication_slots where slot_type = 'logical';
> > +-[ RECORD 1 ]--------+-------------------------------+
> > | slot_name           | logical_slot                  |
> > | plugin              | pgoutput                      |
> > | slot_type           | logical                       |
> > | datoid              | 25605                         |
> > | database            | db_test                       |
> > | temporary           | f                             |
> > | active              | f                             |
> > | active_pid          |                               |
> > | xmin                |                               |
> > | catalog_xmin        | 10397                         |
> > | restart_lsn         | 0/3638F5F0                    |
> > | confirmed_flush_lsn | 0/3638F6A0                    |
> > | wal_status          | reserved                      |
> > | safe_wal_size       |                               |
> > | two_phase           | f                             |
> > | inactive_since      | 2025-08-05 10:21:03.342587+02 |
> > | conflicting         | f                             |
> > | invalidation_reason |                               |
> > | failover            | t                             |
> > | synced              | f                             |
> > +---------------------+---------------------------
> >
> > The synced flag keep value false.
> > Following error in in the log
> > 2025-06-10 16:40:58.996 CEST [739829]: [1-1]
> user=,db=,client=,application= LOG: slot sync worker started
> > 2025-06-10 16:40:59.011 CEST [739829]: [2-1]
> user=,db=,client=,application= ERROR: exiting from slot synchronization
> because same name slot "logical_slot" already exists on the standby
> >
> > I would like to make a proposal to address the issue:
> > Since the logical slot is in a failover state on both the primary and
> the standby, an attempt could be made to resynchronize them.
> > I modify the slotsync.c module
> > +++ b/src/backend/replication/logical/slotsync.c
> > @@ -649,24 +649,46 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid
> remote_dbid)
> >
> >                 return false;
> >         }
> > -
> > -       /* Search for the named slot */
> > +       // Both local and remote slot have the same name
> >         if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
> >         {
> >                 bool            synced;
> > +               bool            failover_status = remote_slot->failover;
> >
> >                 SpinLockAcquire(&slot->mutex);
> >                 synced = slot->data.synced;
> >                 SpinLockRelease(&slot->mutex);
> > +
> > +               if (!synced){
> > +
> > +                       Assert(!MyReplicationSlot);
> > +
> > +                       if (failover_status){
> > +
> > +
>  ReplicationSlotAcquire(remote_slot->name, true, true);
> > +
> > +                               // Put the synced flag to true to
> attempt resynchronizing failover slot on the standby
> > +                               MyReplicationSlot->data.synced = true;
> > +
> > +                               ReplicationSlotMarkDirty();
> >
> > -               /* User-created slot with the same name exists, raise
> ERROR. */
> > -               if (!synced)
> > -                       ereport(ERROR,
> > +                               ReplicationSlotRelease();
> > +
> > +                               /* Get rid of a replication slot that is
> no longer wanted */
> > +                               ereport(WARNING,
> > +
>  errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > +                                       errmsg("slot \"%s\" local slot
> has the same name as remote slot and they are in failover mode, try to
> synchronize them",
> > +                                               remote_slot->name));
> > +                               return false; /* Going back to the main
> loop after droping the failover slot */
> > +                       }
> > +                       else
> > +                               /* User-created slot with the same name
> exists, raise ERROR. */
> > +                               ereport(ERROR,
> >
>  errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >                                         errmsg("exiting from slot
> synchronization because same"
> > -                                                  " name slot \"%s\"
> already exists on the standby",
> > -                                                  remote_slot->name));
> > -
> > +                                                       " name slot
> \"%s\" already exists on the standby",
> > +
>  remote_slot->name));
> > +               }
> >                 /*
> >                  * The slot has been synchronized before.
> >                  *
> > This message follows the discussions started in this thread:
> >
> https://www.postgresql.org/message-id/CAA5-nLDvnqGtBsKu4T_s-cS%2BdGbpSLEzRwgep1XfYzGhQ4o65A%40mail.gmail.com
> >
> > Help would be appreciated to move this point forward
> >
>
> Thank You for starting a new thread and working on it.
>
> I think you have given reference to the wrong thread, the correct one
> is [1].
>
> IIUC, the proposed fix is checking if remote_slot is failover-enabled
> and the slot with the same name exists locally but has 'synced'=false,
> enable the 'synced' flag and proceed with synchronization from next
> cycle onward, else error out. But remote-slot's failover will always
> be true if we have reached this stage. Did you actually mean to check
> if the local slot is failover-enabled but has 'synced' set to false
> (indicating it’s a new standby after a switchover)?  Even with that
> check, it might not be the correct thing to overwrite such a slot
> internally. I think in [1], we discussed a couple of ideas related to
> a GUC, alter-api, drop_all_slots API, but I don't see any of those
> proposed here. Do we want to try any of that?
>
> [1]:
>
> https://www.postgresql.org/message-id/flat/CAA5-nLD0vKn6T1-OHROBNfN2Pxa17zVo4UoVBdfHn2y%3D7nKixA%40mail.gmail.com
>
> thanks
> Shveta
>

Re: Issue with logical replication slot during switchover

Reply via email to