Re: confusing results from pg_get_replication_slots()

Robert Haas Sat, 03 Jan 2026 05:12:46 -0800

On Sat, Jan 3, 2026 at 7:22 AM Andrey Borodin <[email protected]> wrote:
> I concur that showing "unreserved" when there is no actual WAL is a bug.
> Proposed fix will work and is very succinct. Resulting code structure is not 
> super elegant, but acceptable.

Agreed.

> I don't fully understand circumstances when this bug can do any harm. Maybe 
> negative safe_wal_size could be a surprise for some monitoring tools.

Yes, the fact that safe_wal_size can go negative is one of the things
that makes me think this outcome was not really intended.

> I don't understand a reason to disallow reviving a slot. Ofc with some new 
> LSN that is currently available in pg_wal.
>
> Imagine a following scenario: in a cluster of a Primary and a Standby a long 
> analytical query is causing huge lag, primary removes some WAL segments due 
> to max_slot_wal_keep_size, standby is disconnected, consumes several WALs 
> from archive, catches up and continues. Or, if something was vacuumed, 
> cancels analytical query. If we disallow reconnection of this stanby, it will 
> stay in archive recovery. I don't see how it's a good thing.

I think for physical slots invalidation is a little bit of an odd
concept -- why do we ever invalidate a physical slot at all, rather
than just stop reserving WAL at some point and let what happens,
happen? But the reality is that the slot cannot be resurrected once
invalidated; you have to drop and recreate it. Possibly we should
revisit that decision or document the logic more clearly, but that's
not something to think of back-paching.

> > On 3 Jan 2026, at 02:10, Robert Haas <[email protected]> wrote:
> >
> > Maybe we shouldn't display "lost" when the slot
> > is invalidated but "invalidated", for example, and any other value
> > means we're just returning whatever GetWALAvaliability() told us.
> > Also, maybe the exception for connect slots should just be removed, on
> > the assumption that the race condition isn't common enough to matter,
> > or maybe that logic should be pushed down into GetWALAvailability() if
> > we want to keep it.
>
> I don't think following logic works: "someone seems to be connected to this 
> slot, perhaps it's still not lost". This is error-prone heuristics that is 
> trying to workaround possibly stale restart_lsn.
> For HEAD I'd propose to actually read restart_lsn, and determine if walsender 
> will issue "requested WAL segment has already been removed" on next attempt 
> to send something. In this case slot is "lost".
>
> If I understand correctly, slot might be "invalidated", but not "lost" in 
> this sense yet: timeout occured, but WAL is still there.

What I think is *really bad* about this situation is that, when the
slot is invalidated, showing it as unreserved makes it still look
potentially useful. But no matter whether the WAL is present or not,
the slot neither serves to reserve WAL or to hold back xmin once
invaliated. Therefore it is not useful. The user would be better off
using no slot at all, in which case xmin would be held back and WAL
reserved at least while the walreceiver is connected. It is not a
question of whether the user can stream from the slot: the user
doesn't need a slot to stream. It's a question of whether the user
erroneously believes themselves to be protected against something when
in fact they are using a defunct slot that is worse than no slot at
all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: confusing results from pg_get_replication_slots()

Reply via email to