At Mon, 6 Apr 2020 14:58:39 -0400, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote in > > LOG: slot rep1 is invalidated at 0/1C00000 due to exceeding > > max_slot_wal_keep_size > > Sounds good. Here's a couple of further adjustments to your v24. This > passes the existing tests (pg_basebackup exception noted below), but I > don't have the updated 019_replslot_limit.pl, so that still needs to be > verified. > > First, cosmetic changes in xlog.c. > > Second, an unrelated bugfix: ReplicationSlotsComputeLogicalRestartLSN() > is able to return InvalidXLogRecPtr if there's a slot with invalid > restart_lsn. I'm fairly certain that that's bogus. I think this needs > to be backpatched.
Logical slots are not assumed to be in that state, tait is, in_use but having invalid restart_lsn. Maybe we need to define the behavor if restart_lsn is invalid (but confirmed_flush_lsn is valid)? > Third: The loop in InvalidateObsoleteReplicationSlots was reading > restart_lsn without aquiring mutex. Split the "continue" line in two, so > in_use is checked without spinlock and restart_lsn is checked with it. Right. Thanks. > This means we also need to store restart_lsn in a local variable before > logging the message (because we don't want to log with spinlock held). > Also, use ereport() not elog() for that, and add quotes to the slot > name. I omitted the quotes since slot names don't contain white spaces, but, yes, it is quoted in other places. elog is just my bad. > Lastly, I noticed that we're now changing the slot's restart_lsn to > Invalid without being the slot's owner, which goes counter to what is > said in slot.h: > > * - Individual fields are protected by mutex where only the backend owning > * the slot is authorized to update the fields from its own slot. The > * backend owning the slot does not need to take this lock when reading its > * own fields, while concurrent backends not owning this slot should take the > * lock when reading this slot's data. > > What this means is that if the slot owner walsender updates the > restart_lsn to a newer value just as we (checkpointer) are trying to set > it to Invalid, the owner's value might persist and our value would be > lost. Right. > AFAICT if we were really stressed about getting this exactly correct, > then we would have to kill the walsender, wait for it to die, then > ReplicationSlotAcquire and *then* update > MyReplicationSlot->data.restart_lsn. But I don't think we want to do > that during checkpoint, and I'm not sure we need to be as strict anyway: Agreed. > it seems to me that it suffices to check restart_lsn for being invalid > in the couple of places where the slot's owner advances (which is the > two auxiliary functions for ProcessStandbyReplyMessage). I have done so > in the attached. There are other places where the restart_lsn is set, > but those seem to be used only when the slot is created. I don't think > we need to cover for those, but I'm not 100% sure about that. StartLogicalReplcation does "XLogBeginRead(,MyReplicationSlot->data.restart_lsn)". If the restart_lsn is invalid, following call to XLogReadRecord runs into assertion failure. Walsender (or StartLogicalReplication) should correctly reject reconnection from the subscriber if restart_lsn is invalid. > However, the change in PhysicalConfirmReceivedLocation() breaks > the way slots work for pg_basebackup: apparently the slot is created > with a restart_lsn of Invalid and we only advance it the first time we > process a feedback message from pg_basebackup. I have a vague feeling > that that's bogus, but I'll have to look at the involved code a little > bit more closely to be sure about this. Mmm. Couldn't we have a new member 'invalidated' in ReplicationSlot? > One last thing: I think we need to ReplicationSlotMarkDirty() and > ReplicationSlotSave() after changing the LSN. My patch doesn't do that. Oops. > I noticed that the checkpoint already saved the slot once; maybe it > would make more sense to avoid doubly-writing the files by removing > CheckPointReplicationSlots() from CheckPointGuts, and instead call it > just after doing InvalidateObsoleteReplicationSlots(). But this is not > very important, since we don't expect to be modifying slots because of > disk-space reasons very frequently anyway. Agreed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center