On Fri, May 23, 2025 at 12:55 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > The remote_slot (slot on primary) should be advanced before you invoke > sync_slot. Can you do pg_logical_slot_get_changes() API before performing > sync? You can check the xmin of the logical slot after get_changes to ensure > that xmin has moved to 765 in your case.
I'm fairly dismayed by this example. I hope I'm misunderstanding something, because otherwise I have difficulty understanding how we thought it was OK to ship this feature in this condition. At the moment that pg_sync_replication_slots() is executed, a slot named failover_slot exists on only one of the two servers. How can you justify emitting an error message complaining that "remote slot precedes local slot"? There's only one slot! I understand that, under the hood, we probably created an additional slot on the standby and then tried to fast-forward it, and this error occurred in the second step. But a user shouldn't have to understand those kinds of internal implementation details to make sense of the error message. If the problem is that we're not able to create a slot on the standby at an old enough LSN or XID position to permit its use with the corresponding slot on the master, it should be reported that way. It also seems like having to execute a manual step like pg_logical_slot_get_changes() in order for things to work is really missing the point of the feature. I mean, it seems like the intention of the feature was that someone can just periodically call pg_sync_replication_slots() on each standby and the right things will happen -- creating slots or fast-forwarding them or dropping them, as required. But if that sometimes requires manual fiddling like having to consume changes from a slot then basically the feature just doesn't work, because now the user will have to somehow understand when that is required and what they need to do to fix it. This doesn't even seem like a particularly obscure case. To be honest, even after spending quite a bit of time on this, I still don't really understand what's happening with the xmins here. Just after creating the logical slot on the primary, it has xmin 764 on one slot and xmin 765 on the other, and I don't understand why that's the case, nor why the extra DDL command is needed to trigger the problem. But I also can't shake the feeling that I shouldn't *need* to understand that stuff to use the feature. Isn't that the whole point? -- Robert Haas EDB: http://www.enterprisedb.com