I noticed some other things: 1. KeepLogSeg sends a warning message when slots fall behind. To do this, it searches for "the most affected slot", that is, the slot that lost the most data. But it seems to me that that's a bit pointless; if a slot data, it's now useless and anything that was using that slot must be recreated. If you only know what's the most affected slot, it's not possible to see which *other* slots are affected. It doesn't matter if the slot missed one segment or twenty segments or 9999 segments -- the slot is now useless, or it is not useless. I think we should list the slot that was *least* affected, i.e., the slot that lost the minimum amount of segments; then the user knows that all slots that are older than that one are *also* affected.
2. KeepLogSeg ignores slots that are active. I guess the logic here is that if a slot is active, then it'll keep going until it catches up and we don't need to do anything about the used disk space. But that seems a false premise, because if a standby is so slow that it cannot keep up, it will eventually run the master out of diskspace even if it's active all the time. So I'm not seeing the reasoning that makes it useful to skip checking active slots. (BTW I don't think you need to keep that many static variables in that function. Just the slot name should be sufficient, I think ... or maybe even the *pointer* to the slot that was last reported. I think if a slot is behind and it lost segments, we should kill the walsender that's using it, and unreserve the segments. So maybe something like LWLockAcquire( ... ); for (i = 0 ; i < max_replication_slots; i++) { ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; XLogSegNo slotSegNo; XLByteToSeg(s->data.restart_lsn, slotSegNo, wal_segment_size); if (s->in_use) { if (s->active_pid) pids_to_kill = lappend(pids_to_kill, s->active_pid); nslots_affected++; ... ; /* other stuff */ } } LWLockRelease( ... ) /* release lock before syscalls */ foreach(l, pids_to_kill) { kill(SIGTERM, lfirst_int(l)); } I sense some attempt to salvage slots that are reading a segment that is "outdated" and removed, but for which the walsender has an open file descriptor. (This appears to be the "losing" state.) This seems dangerous, for example the segment might be recycled and is being overwritten with different data. Trying to keep track of that seems doomed. And even if the walsender can still read that data, it's only a matter of time before the next segment is also removed. So keeping the walsender alive is futile; it only delays the inevitable. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services