Hi,

On 2023-04-07 18:32:04 -0400, Melanie Plageman wrote:
> Code review only of 0001-0005.
> 
> I noticed you had two 0008, btw.

Yea, sorry for that. One was the older version. Just before sending the patch
I saw another occurance of a test failure, which I then fixed. In the course
of that I changed something in the patch subject.


> On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-04-07 08:47:57 -0700, Andres Freund wrote:
> > > Integrated all of these.
> > 
> > From 0e038eb5dfddec500fbf4625775d1fa508a208f6 Mon Sep 17 00:00:00 2001
> > From: Andres Freund <and...@anarazel.de>
> > Date: Thu, 6 Apr 2023 20:00:07 -0700
> > Subject: [PATCH va67 1/9] Replace a replication slot's invalidated_at LSN 
> > with
> >  an enum
> > 
> > diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
> > index 8872c80cdfe..ebcb637baed 100644
> > --- a/src/include/replication/slot.h
> > +++ b/src/include/replication/slot.h
> > @@ -37,6 +37,17 @@ typedef enum ReplicationSlotPersistency
> >     RS_TEMPORARY
> >  } ReplicationSlotPersistency;
> >  
> > +/*
> > + * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the
> > + * 'invalidated' field is set to a value other than _NONE.
> > + */
> > +typedef enum ReplicationSlotInvalidationCause
> > +{
> > +   RS_INVAL_NONE,
> > +   /* required WAL has been removed */
> 
> I just wonder if RS_INVAL_WAL is too generic. Something like
> RS_INVAL_WAL_MISSING or similar may be better since it seems there are
> other inavlidation causes that may be related to WAL.

Renamed to RS_INVAL_WAL_REMOVED



> > From 52c25cc15abc4470d19e305d245b9362e6b8d6a3 Mon Sep 17 00:00:00 2001
> > From: Andres Freund <and...@anarazel.de>
> > Date: Fri, 7 Apr 2023 09:32:48 -0700
> > Subject: [PATCH va67 3/9] Support invalidating replication slots due to
> >  horizon and wal_level
> > MIME-Version: 1.0
> > Content-Type: text/plain; charset=UTF-8
> > Content-Transfer-Encoding: 8bit
> > 
> > Needed for supporting logical decoding on a standby. The new invalidation
> > methods will be used in a subsequent commit.
> > 
> 
> You probably are aware, but applying 0003 and 0004 both gives me two
> warnings:
> 
> warning: 1 line adds whitespace errors.
> Warning: commit message did not conform to UTF-8.
> You may want to amend it after fixing the message, or set the config
> variable i18n.commitEncoding to the encoding your project uses.

I did see the whitespace error, but not the encoding error. We have a bunch of
other commit messages with Fabrizio's name "properly spelled" in, so I think
that's ok.



> > diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> > index df23b7ed31e..c2a9accebf6 100644
> > --- a/src/backend/replication/slot.c
> > +++ b/src/backend/replication/slot.c
> > @@ -1241,8 +1241,58 @@ ReplicationSlotReserveWal(void)
> >  }
> >  
> >  /*
> > - * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
> > - * and mark it invalid, if necessary and possible.
> > + * Report that replication slot needs to be invalidated
> > + */
> > +static void
> > +ReportSlotInvalidation(ReplicationSlotInvalidationCause cause,
> > +                                      bool terminating,
> > +                                      int pid,
> > +                                      NameData slotname,
> > +                                      XLogRecPtr restart_lsn,
> > +                                      XLogRecPtr oldestLSN,
> > +                                      TransactionId 
> > snapshotConflictHorizon)
> > +{
> > +   StringInfoData err_detail;
> > +   bool            hint = false;
> > +
> > +   initStringInfo(&err_detail);
> > +
> > +   switch (cause)
> > +   {
> > +           case RS_INVAL_WAL:
> > +                   hint = true;
> > +                   appendStringInfo(&err_detail, _("The slot's restart_lsn 
> > %X/%X exceeds the limit by %llu bytes."),
> > +                                                    
> > LSN_FORMAT_ARGS(restart_lsn),
> 
> I'm not sure what the below cast is meant to do. If you are trying to
> protect against overflow/underflow, I think you'd need to cast before
> doing the subtraction.


> > +                                                    (unsigned long long) 
> > (oldestLSN - restart_lsn));
> > +                   break;

That's our current way of passing 64bit numbers to format string
functions. It ends up as a 64bit number everywhere, even windows (with its
stupid ILP32 model).



> > +           case RS_INVAL_HORIZON:
> > +                   appendStringInfo(&err_detail, _("The slot conflicted 
> > with xid horizon %u."),
> > +                                                    
> > snapshotConflictHorizon);
> > +                   break;
> > +
> > +           case RS_INVAL_WAL_LEVEL:
> > +                   appendStringInfo(&err_detail, _("Logical decoding on 
> > standby requires wal_level to be at least logical on the primary server"));
> > +                   break;
> > +           case RS_INVAL_NONE:
> > +                   pg_unreachable();
> > +   }
> 
> This ereport is quite hard to read. Is there any simplification you can
> do of the ternaries without undue duplication?

I tried a bunch, and none really seemed an improvement.


> > @@ -1286,10 +1340,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, 
> > XLogRecPtr oldestLSN,
> >             restart_lsn = s->data.restart_lsn;
> >  
> >             /*
> > -            * If the slot is already invalid or is fresh enough, we don't 
> > need to
> > -            * do anything.
> > +            * If the slot is already invalid or is a non conflicting slot, 
> > we
> > +            * don't need to do anything.
> >              */
> > -           if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= 
> > oldestLSN)
> > +           if (s->data.invalidated == RS_INVAL_NONE)
> > +           {
> > +                   switch (cause)
> > +                   {
> > +                           case RS_INVAL_WAL:
> > +                                   if (s->data.restart_lsn != 
> > InvalidXLogRecPtr &&
> > +                                           s->data.restart_lsn < oldestLSN)
> > +                                           conflict = cause;
> > +                                   break;
> 
> Should the below be an error? a physical slot with RS_INVAL_HORIZON
> invalidation cause?

InvalidatePossiblyObsoleteSlot() gets called for all existing slots, so it's
normal for RS_INVAL_HORIZON to encounter a physical slot.


> > +                           case RS_INVAL_HORIZON:
> > +                                   if (!SlotIsLogical(s))
> > +                                           break;
> > +                                   /* invalid DB oid signals a shared 
> > relation */
> > +                                   if (dboid != InvalidOid && dboid != 
> > s->data.database)
> > +                                           break;
> > +                                   if 
> > (TransactionIdIsValid(s->effective_xmin) &&
> > +                                           
> > TransactionIdPrecedesOrEquals(s->effective_xmin,
> > +                                                                           
> >                           snapshotConflictHorizon))
> > +                                           conflict = cause;
> > +                                   else if 
> > (TransactionIdIsValid(s->effective_catalog_xmin) &&
> > +                                                    
> > TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
> > +                                                                           
> >                                    snapshotConflictHorizon))
> > +                                           conflict = cause;
> > +                                   break;
> > +                           case RS_INVAL_WAL_LEVEL:
> > +                                   if (SlotIsLogical(s))
> > +                                           conflict = cause;
> > +                                   break;
> 
> All three of default, pg_unreachable(), and break seems a bit like
> overkill. Perhaps remove the break?

I always get nervous about case statements without a break, due to the
fallthrough behaviour of switch/case. So I add it very habitually. I replaced
it with
                                case RS_INVAL_NONE:
                                        pg_unreachable();
that way we get warnings about unhandled cases too. Not sure why I hadn't done
that.

> > @@ -1390,14 +1476,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, 
> > XLogRecPtr oldestLSN,
> >                     ReplicationSlotMarkDirty();
> >                     ReplicationSlotSave();
> >                     ReplicationSlotRelease();
> > +                   pgstat_drop_replslot(s);
> >  
> > -                   ereport(LOG,
> > -                                   errmsg("invalidating obsolete 
> > replication slot \"%s\"",
> > -                                              NameStr(slotname)),
> > -                                   errdetail("The slot's restart_lsn %X/%X 
> > exceeds the limit by %llu bytes.",
> > -                                                     
> > LSN_FORMAT_ARGS(restart_lsn),
> > -                                                     (unsigned long long) 
> > (oldestLSN - restart_lsn)),
> > -                                   errhint("You might need to increase 
> > max_slot_wal_keep_size."));
> > +                   ReportSlotInvalidation(conflict, false, active_pid,
> > +                                                              slotname, 
> > restart_lsn,
> > +                                                              oldestLSN, 
> > snapshotConflictHorizon);
> >  
> >                     /* done with this slot for now */
> >                     break;
> > @@ -1410,19 +1493,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, 
> > XLogRecPtr oldestLSN,
> >  }
> >  
> >  /*
> > - * Mark any slot that points to an LSN older than the given segment
> > - * as invalid; it requires WAL that's about to be removed.
> > + * Invalidate slots that require resources about to be removed.
> >   *
> >   * Returns true when any slot have got invalidated.
> >   *
> > + * Whether a slot needs to be invalidated depends on the cause. A slot is
> > + * removed if it:
> > + * - RS_INVAL_WAL: requires a LSN older than the given segment
> > + * - RS_INVAL_HORIZON: requires a snapshot <= the given horizon, in the 
> > given db
> > +     dboid may be InvalidOid for shared relations
> 
> the comma above reduces readability
> 
> is this what you mean?
> 
> RS_INVAL_HORIZON: requires a snapshot <= the given horizon in the given
> db; dboid may be InvalidOid for shared relations

Yep.


> > @@ -1443,7 +1443,13 @@ 
> > InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
> >                                                                        
> > slotname, restart_lsn,
> >                                                                        
> > oldestLSN, snapshotConflictHorizon);
> >  
> > -                           (void) kill(active_pid, SIGTERM);
> > +                           if (MyBackendType == B_STARTUP)
> 
> Is SendProcSignal() marked warn_unused_result or something? I don't see
> other callers who don't use its return value void casting it.

I went back and forth about it. I think Bertrand added. It looks a bit odd to
have it, for the reason you say. It also looks a bit odd to not have, given
the parallel (void) kill().


> > diff --git a/src/backend/storage/ipc/standby.c 
> > b/src/backend/storage/ipc/standby.c
> > index 9f56b4e95cf..3b5d654347e 100644
> > --- a/src/backend/storage/ipc/standby.c
> > +++ b/src/backend/storage/ipc/standby.c
> > @@ -24,6 +24,7 @@
> >  #include "access/xlogutils.h"
> >  #include "miscadmin.h"
> >  #include "pgstat.h"
> > +#include "replication/slot.h"
> >  #include "storage/bufmgr.h"
> >  #include "storage/lmgr.h"
> >  #include "storage/proc.h"
> > @@ -466,6 +467,7 @@ 
> > ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
> >   */
> >  void
> >  ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
> > +                                                                   bool 
> > isCatalogRel,
> >                                                                     
> > RelFileLocator locator)
> >  {
> >     VirtualTransactionId *backends;
> > @@ -491,6 +493,16 @@ ResolveRecoveryConflictWithSnapshot(TransactionId 
> > snapshotConflictHorizon,
> >                                                                             
> >    PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
> >                                                                             
> >    WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
> >                                                                             
> >    true);
> > +
> > +   /*
> > +    * Note that WaitExceedsMaxStandbyDelay() is not taken into account here
> > +    * (as opposed to ResolveRecoveryConflictWithVirtualXIDs() above). That
> > +    * seems OK, given that this kind of conflict should not normally be
> 
> do you mean "when using a physical replication slot"?

> > +    * reached, e.g. by using a physical replication slot.
> > +    */
> > +   if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
> > +           InvalidateObsoleteReplicationSlots(RS_INVAL_HORIZON, 0, 
> > locator.dbOid,
> > +                                                                           
> >    snapshotConflictHorizon);
> >  }

No. I mean that normally a physical replication slot, or some other approach,
should prevent such conflicts. I replaced 'by' with 'due to'


Thanks a lot for the review!

Greetings,

Andres Freund


Reply via email to