Hi, On 2023-04-07 18:32:04 -0400, Melanie Plageman wrote: > Code review only of 0001-0005. > > I noticed you had two 0008, btw.
Yea, sorry for that. One was the older version. Just before sending the patch I saw another occurance of a test failure, which I then fixed. In the course of that I changed something in the patch subject. > On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote: > > Hi, > > > > On 2023-04-07 08:47:57 -0700, Andres Freund wrote: > > > Integrated all of these. > > > > From 0e038eb5dfddec500fbf4625775d1fa508a208f6 Mon Sep 17 00:00:00 2001 > > From: Andres Freund <and...@anarazel.de> > > Date: Thu, 6 Apr 2023 20:00:07 -0700 > > Subject: [PATCH va67 1/9] Replace a replication slot's invalidated_at LSN > > with > > an enum > > > > diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h > > index 8872c80cdfe..ebcb637baed 100644 > > --- a/src/include/replication/slot.h > > +++ b/src/include/replication/slot.h > > @@ -37,6 +37,17 @@ typedef enum ReplicationSlotPersistency > > RS_TEMPORARY > > } ReplicationSlotPersistency; > > > > +/* > > + * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the > > + * 'invalidated' field is set to a value other than _NONE. > > + */ > > +typedef enum ReplicationSlotInvalidationCause > > +{ > > + RS_INVAL_NONE, > > + /* required WAL has been removed */ > > I just wonder if RS_INVAL_WAL is too generic. Something like > RS_INVAL_WAL_MISSING or similar may be better since it seems there are > other inavlidation causes that may be related to WAL. Renamed to RS_INVAL_WAL_REMOVED > > From 52c25cc15abc4470d19e305d245b9362e6b8d6a3 Mon Sep 17 00:00:00 2001 > > From: Andres Freund <and...@anarazel.de> > > Date: Fri, 7 Apr 2023 09:32:48 -0700 > > Subject: [PATCH va67 3/9] Support invalidating replication slots due to > > horizon and wal_level > > MIME-Version: 1.0 > > Content-Type: text/plain; charset=UTF-8 > > Content-Transfer-Encoding: 8bit > > > > Needed for supporting logical decoding on a standby. The new invalidation > > methods will be used in a subsequent commit. > > > > You probably are aware, but applying 0003 and 0004 both gives me two > warnings: > > warning: 1 line adds whitespace errors. > Warning: commit message did not conform to UTF-8. > You may want to amend it after fixing the message, or set the config > variable i18n.commitEncoding to the encoding your project uses. I did see the whitespace error, but not the encoding error. We have a bunch of other commit messages with Fabrizio's name "properly spelled" in, so I think that's ok. > > diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c > > index df23b7ed31e..c2a9accebf6 100644 > > --- a/src/backend/replication/slot.c > > +++ b/src/backend/replication/slot.c > > @@ -1241,8 +1241,58 @@ ReplicationSlotReserveWal(void) > > } > > > > /* > > - * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot > > - * and mark it invalid, if necessary and possible. > > + * Report that replication slot needs to be invalidated > > + */ > > +static void > > +ReportSlotInvalidation(ReplicationSlotInvalidationCause cause, > > + bool terminating, > > + int pid, > > + NameData slotname, > > + XLogRecPtr restart_lsn, > > + XLogRecPtr oldestLSN, > > + TransactionId > > snapshotConflictHorizon) > > +{ > > + StringInfoData err_detail; > > + bool hint = false; > > + > > + initStringInfo(&err_detail); > > + > > + switch (cause) > > + { > > + case RS_INVAL_WAL: > > + hint = true; > > + appendStringInfo(&err_detail, _("The slot's restart_lsn > > %X/%X exceeds the limit by %llu bytes."), > > + > > LSN_FORMAT_ARGS(restart_lsn), > > I'm not sure what the below cast is meant to do. If you are trying to > protect against overflow/underflow, I think you'd need to cast before > doing the subtraction. > > + (unsigned long long) > > (oldestLSN - restart_lsn)); > > + break; That's our current way of passing 64bit numbers to format string functions. It ends up as a 64bit number everywhere, even windows (with its stupid ILP32 model). > > + case RS_INVAL_HORIZON: > > + appendStringInfo(&err_detail, _("The slot conflicted > > with xid horizon %u."), > > + > > snapshotConflictHorizon); > > + break; > > + > > + case RS_INVAL_WAL_LEVEL: > > + appendStringInfo(&err_detail, _("Logical decoding on > > standby requires wal_level to be at least logical on the primary server")); > > + break; > > + case RS_INVAL_NONE: > > + pg_unreachable(); > > + } > > This ereport is quite hard to read. Is there any simplification you can > do of the ternaries without undue duplication? I tried a bunch, and none really seemed an improvement. > > @@ -1286,10 +1340,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, > > XLogRecPtr oldestLSN, > > restart_lsn = s->data.restart_lsn; > > > > /* > > - * If the slot is already invalid or is fresh enough, we don't > > need to > > - * do anything. > > + * If the slot is already invalid or is a non conflicting slot, > > we > > + * don't need to do anything. > > */ > > - if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= > > oldestLSN) > > + if (s->data.invalidated == RS_INVAL_NONE) > > + { > > + switch (cause) > > + { > > + case RS_INVAL_WAL: > > + if (s->data.restart_lsn != > > InvalidXLogRecPtr && > > + s->data.restart_lsn < oldestLSN) > > + conflict = cause; > > + break; > > Should the below be an error? a physical slot with RS_INVAL_HORIZON > invalidation cause? InvalidatePossiblyObsoleteSlot() gets called for all existing slots, so it's normal for RS_INVAL_HORIZON to encounter a physical slot. > > + case RS_INVAL_HORIZON: > > + if (!SlotIsLogical(s)) > > + break; > > + /* invalid DB oid signals a shared > > relation */ > > + if (dboid != InvalidOid && dboid != > > s->data.database) > > + break; > > + if > > (TransactionIdIsValid(s->effective_xmin) && > > + > > TransactionIdPrecedesOrEquals(s->effective_xmin, > > + > > snapshotConflictHorizon)) > > + conflict = cause; > > + else if > > (TransactionIdIsValid(s->effective_catalog_xmin) && > > + > > TransactionIdPrecedesOrEquals(s->effective_catalog_xmin, > > + > > snapshotConflictHorizon)) > > + conflict = cause; > > + break; > > + case RS_INVAL_WAL_LEVEL: > > + if (SlotIsLogical(s)) > > + conflict = cause; > > + break; > > All three of default, pg_unreachable(), and break seems a bit like > overkill. Perhaps remove the break? I always get nervous about case statements without a break, due to the fallthrough behaviour of switch/case. So I add it very habitually. I replaced it with case RS_INVAL_NONE: pg_unreachable(); that way we get warnings about unhandled cases too. Not sure why I hadn't done that. > > @@ -1390,14 +1476,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, > > XLogRecPtr oldestLSN, > > ReplicationSlotMarkDirty(); > > ReplicationSlotSave(); > > ReplicationSlotRelease(); > > + pgstat_drop_replslot(s); > > > > - ereport(LOG, > > - errmsg("invalidating obsolete > > replication slot \"%s\"", > > - NameStr(slotname)), > > - errdetail("The slot's restart_lsn %X/%X > > exceeds the limit by %llu bytes.", > > - > > LSN_FORMAT_ARGS(restart_lsn), > > - (unsigned long long) > > (oldestLSN - restart_lsn)), > > - errhint("You might need to increase > > max_slot_wal_keep_size.")); > > + ReportSlotInvalidation(conflict, false, active_pid, > > + slotname, > > restart_lsn, > > + oldestLSN, > > snapshotConflictHorizon); > > > > /* done with this slot for now */ > > break; > > @@ -1410,19 +1493,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, > > XLogRecPtr oldestLSN, > > } > > > > /* > > - * Mark any slot that points to an LSN older than the given segment > > - * as invalid; it requires WAL that's about to be removed. > > + * Invalidate slots that require resources about to be removed. > > * > > * Returns true when any slot have got invalidated. > > * > > + * Whether a slot needs to be invalidated depends on the cause. A slot is > > + * removed if it: > > + * - RS_INVAL_WAL: requires a LSN older than the given segment > > + * - RS_INVAL_HORIZON: requires a snapshot <= the given horizon, in the > > given db > > + dboid may be InvalidOid for shared relations > > the comma above reduces readability > > is this what you mean? > > RS_INVAL_HORIZON: requires a snapshot <= the given horizon in the given > db; dboid may be InvalidOid for shared relations Yep. > > @@ -1443,7 +1443,13 @@ > > InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, > > > > slotname, restart_lsn, > > > > oldestLSN, snapshotConflictHorizon); > > > > - (void) kill(active_pid, SIGTERM); > > + if (MyBackendType == B_STARTUP) > > Is SendProcSignal() marked warn_unused_result or something? I don't see > other callers who don't use its return value void casting it. I went back and forth about it. I think Bertrand added. It looks a bit odd to have it, for the reason you say. It also looks a bit odd to not have, given the parallel (void) kill(). > > diff --git a/src/backend/storage/ipc/standby.c > > b/src/backend/storage/ipc/standby.c > > index 9f56b4e95cf..3b5d654347e 100644 > > --- a/src/backend/storage/ipc/standby.c > > +++ b/src/backend/storage/ipc/standby.c > > @@ -24,6 +24,7 @@ > > #include "access/xlogutils.h" > > #include "miscadmin.h" > > #include "pgstat.h" > > +#include "replication/slot.h" > > #include "storage/bufmgr.h" > > #include "storage/lmgr.h" > > #include "storage/proc.h" > > @@ -466,6 +467,7 @@ > > ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist, > > */ > > void > > ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon, > > + bool > > isCatalogRel, > > > > RelFileLocator locator) > > { > > VirtualTransactionId *backends; > > @@ -491,6 +493,16 @@ ResolveRecoveryConflictWithSnapshot(TransactionId > > snapshotConflictHorizon, > > > > PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, > > > > WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT, > > > > true); > > + > > + /* > > + * Note that WaitExceedsMaxStandbyDelay() is not taken into account here > > + * (as opposed to ResolveRecoveryConflictWithVirtualXIDs() above). That > > + * seems OK, given that this kind of conflict should not normally be > > do you mean "when using a physical replication slot"? > > + * reached, e.g. by using a physical replication slot. > > + */ > > + if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel) > > + InvalidateObsoleteReplicationSlots(RS_INVAL_HORIZON, 0, > > locator.dbOid, > > + > > snapshotConflictHorizon); > > } No. I mean that normally a physical replication slot, or some other approach, should prevent such conflicts. I replaced 'by' with 'due to' Thanks a lot for the review! Greetings, Andres Freund