On Sat, Jan 3, 2026 at 1:22 PM Andrey Borodin <[email protected]> wrote:
>
> Hi Robert!

Hi Robert, Andrey,

> I don't understand a reason to disallow reviving a slot. Ofc with some new 
> LSN that is currently available in pg_wal.
>
> Imagine a following scenario: in a cluster of a Primary and a Standby a long 
> analytical query is causing huge lag, primary removes some WAL segments due 
> to max_slot_wal_keep_size, standby is disconnected, consumes several WALs 
> from archive, catches up and continues. Or, if something was vacuumed, 
> cancels analytical query. If we disallow reconnection of this stanby, it will 
> stay in archive recovery. I don't see how it's a good thing.

The key problem here (as I understand it) is that STABLE branches can
silently disable hot_standby_feedback and cause unexplainable query
cancellations (of type confl_snapshot). So to frame this $thread
properly: for some people - such a query offload through standby using
hot_standby_feedback - is critical functionality and it is important
they know when it stopped working (not after getting conflicts).

So, the behaviour of e.g. v16 is confusing when there is use of
max_safe_wal_slot_size (thus activates slot invalidation), replication
lag might be present, and restore_command is in play (to easier
reproduce, not necessary I think) and all of sudden this may cause
confl_snapshot issues might arise. If we add to that the
pg_replication_slot reports wal_status='unreserved' (instead of
"lost"), but restart_lsn is progressing makes it even harder to
understand what's going on. This fix by Robert makes it simply easier
to spot what's going on, but does not fix or prevent the core issue
itself.

I'll start from the end: from my tests it looks like v19/master
behaves more sane today and does kill such replication connections
(marks the slot as "lost" and *prevents* connections there), so the
whole query cancellation discussion is simply not possible there.

primary:
    2026-01-05 11:31:11.447 CET [40926] LOG:  checkpoint starting: wal
    2026-01-05 11:31:11.457 CET [40926] LOG:  terminating process
41272 to release replication slot "slot1"
    2026-01-05 11:31:11.457 CET [40926] DETAIL:  The slot's
restart_lsn 0/8B000000 exceeds the limit by 16777216 bytes.
    2026-01-05 11:31:11.457 CET [40926] HINT:  You might need to
increase "max_slot_wal_keep_size".
    2026-01-05 11:31:11.457 CET [41272] FATAL:  terminating connection
due to administrator command
    2026-01-05 11:31:11.457 CET [41272] STATEMENT:  START_REPLICATION
SLOT "slot1" 0/06000000 TIMELINE 1
    2026-01-05 11:31:11.460 CET [41272] LOG:  released physical
replication slot "slot1"
    2026-01-05 11:31:11.462 CET [40926] LOG:  invalidating obsolete
replication slot "slot1"
    2026-01-05 11:31:11.462 CET [40926] DETAIL:  The slot's
restart_lsn 0/8B000000 exceeds the limit by 16777216 bytes.
    2026-01-05 11:31:11.462 CET [40926] HINT:  You might need to
increase "max_slot_wal_keep_size".

Even with archiving enabled, the standby won't be ever able to
reconnect unless the slot is recreated (standby will continue recovery
using restore_command, but there won't be lag as pg_stat_replication
will be empty too as of course there won't be connection). This is a
clear message and one knows how to deal with that operationally and it
doesn't cause any mysterious conflicts out of the blue (which is
good). To Andrey's point, the v19 change, could be viewed as feature
regression in situation where down primary replication is fixed by the
restore_command (the v19 simply throws above until slot is recreated,
while the older versions would silently re-connect using replication
connection and switch to walreceiver path [instead of
restore_command], but that would disarm hot_standby_feedback silently
-- which cause false premise that it works, while in reality it does
not which IMHVO is larger grip that this $patch itself).

I haven't tested this one, but Amit's
f41d8468ddea34170fe19fdc17b5a247e7d3ac78 is within REL_18_STABLE, so
that's been that way for quite some time for the
ReplicationSlotAcquire().

So now, with e.g. v16:
- it allows to reconnect to "lost" slots
- this silently disarms hot_standby_feedback and causes unexplainable
query cancellations
- it shows such re-connected slots as "unreserved" without patch
(which is bizarre and even harder to diagnose), so +1 to making it
"lost" instead. That makes it a little bit more visible (but certainly
it doesnt make it fully obvious that it may be disarming
hot_standby_feedback at all). BTW I've tested the patch and in the
exact same conditions it is now reporting "lost" correctly.
- however even with that, it is pretty unclear how and when people
should arrive at conclusion they should recreate their slots and that
this is the issue that might be causing query conflicts. Anyway the
word "slot" in "26.4.2. Handling Query Conflicts"  [1] is not
mentioned even a single time there, so perhaps we should just update
docs for < v19 in that this is known issue (solved in v19), and it's
quite rare , but if one is using actively "lost" slot they should
simple disconnect standby, recreate slots to avoid that confl_snapshot
cancellation.

v16's primary log
    2026-01-05 13:35:12.560 CET [87976] LOG:  checkpoint starting: wal
    2026-01-05 13:35:20.174 CET [87976] LOG:  terminating process
88033 to release replication slot "slot1"
    2026-01-05 13:35:20.174 CET [87976] DETAIL:  The slot's
restart_lsn 0/A2020F50 exceeds the limit by 16642224 bytes.
    2026-01-05 13:35:20.174 CET [87976] HINT:  You might need to
increase max_slot_wal_keep_size.
    2026-01-05 13:35:20.174 CET [88033] FATAL:  terminating connection
due to administrator command
    2026-01-05 13:35:20.174 CET [88033] STATEMENT:  START_REPLICATION
SLOT "slot1" 0/3000000 TIMELINE 1
    2026-01-05 13:35:51.281 CET [87976] LOG:  invalidating obsolete
replication slot "slot1"
    2026-01-05 13:35:51.281 CET [87976] DETAIL:  The slot's
restart_lsn 0/A2020F50 exceeds the limit by 16642224 bytes.
    2026-01-05 13:35:51.281 CET [87976] HINT:  You might need to
increase max_slot_wal_keep_size.
    [..]
    2026-01-05 13:35:51.407 CET [88659] LOG:  received replication
command: IDENTIFY_SYSTEM
    2026-01-05 13:35:51.407 CET [88659] STATEMENT:  IDENTIFY_SYSTEM
    2026-01-05 13:35:51.407 CET [88659] LOG:  received replication
command: START_REPLICATION SLOT "slot1" 0/A6000000 TIMELINE 1
    2026-01-05 13:35:51.407 CET [88659] STATEMENT:  START_REPLICATION
SLOT "slot1" 0/A6000000 TIMELINE 1

(clearly lack of ReplicationSlotSetInactiveSince() from master).

Then the concern/scenario raised by Andrey -- in v19 (or in future) to
allow the switching from restore_command to WAL streaming (across
invalidated WAL slot) -- it more sounds more like a new enhancement
for the (far) future, but this time while keeping backend xmin
propagation working.

-J.

[1] - 
https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT


Reply via email to