On Sat, Jan 3, 2026 at 7:22 AM Andrey Borodin <[email protected]> wrote: > I concur that showing "unreserved" when there is no actual WAL is a bug. > Proposed fix will work and is very succinct. Resulting code structure is not > super elegant, but acceptable.
Agreed. > I don't fully understand circumstances when this bug can do any harm. Maybe > negative safe_wal_size could be a surprise for some monitoring tools. Yes, the fact that safe_wal_size can go negative is one of the things that makes me think this outcome was not really intended. > I don't understand a reason to disallow reviving a slot. Ofc with some new > LSN that is currently available in pg_wal. > > Imagine a following scenario: in a cluster of a Primary and a Standby a long > analytical query is causing huge lag, primary removes some WAL segments due > to max_slot_wal_keep_size, standby is disconnected, consumes several WALs > from archive, catches up and continues. Or, if something was vacuumed, > cancels analytical query. If we disallow reconnection of this stanby, it will > stay in archive recovery. I don't see how it's a good thing. I think for physical slots invalidation is a little bit of an odd concept -- why do we ever invalidate a physical slot at all, rather than just stop reserving WAL at some point and let what happens, happen? But the reality is that the slot cannot be resurrected once invalidated; you have to drop and recreate it. Possibly we should revisit that decision or document the logic more clearly, but that's not something to think of back-paching. > > On 3 Jan 2026, at 02:10, Robert Haas <[email protected]> wrote: > > > > Maybe we shouldn't display "lost" when the slot > > is invalidated but "invalidated", for example, and any other value > > means we're just returning whatever GetWALAvaliability() told us. > > Also, maybe the exception for connect slots should just be removed, on > > the assumption that the race condition isn't common enough to matter, > > or maybe that logic should be pushed down into GetWALAvailability() if > > we want to keep it. > > I don't think following logic works: "someone seems to be connected to this > slot, perhaps it's still not lost". This is error-prone heuristics that is > trying to workaround possibly stale restart_lsn. > For HEAD I'd propose to actually read restart_lsn, and determine if walsender > will issue "requested WAL segment has already been removed" on next attempt > to send something. In this case slot is "lost". > > If I understand correctly, slot might be "invalidated", but not "lost" in > this sense yet: timeout occured, but WAL is still there. What I think is *really bad* about this situation is that, when the slot is invalidated, showing it as unreserved makes it still look potentially useful. But no matter whether the WAL is present or not, the slot neither serves to reserve WAL or to hold back xmin once invaliated. Therefore it is not useful. The user would be better off using no slot at all, in which case xmin would be held back and WAL reserved at least while the walreceiver is connected. It is not a question of whether the user can stream from the slot: the user doesn't need a slot to stream. It's a question of whether the user erroneously believes themselves to be protected against something when in fact they are using a defunct slot that is worse than no slot at all. -- Robert Haas EDB: http://www.enterprisedb.com
