Re: Introduce XID age and inactive timeout based replication slot invalidation

Peter Smith Thu, 29 Aug 2024 19:43:55 -0700

Hi, here are some review comments for patch v43-0001.

======
Commit message


1.
... introduces a GUC allowing users set inactive timeout.

~

1a. You should give the name of the new GUC in the commit message.

1b. /set/to set/

======
doc/src/sgml/config.sgml

GUC "replication_slot_inactive_timeout"

2.
Invalidates replication slots that are inactive for longer than
specified amount of time

nit - suggest use similar wording as the prior GUC (wal_sender_timeout):
Invalidate replication slots that are inactive for longer than this
amount of time.

~

3.
This invalidation check happens either when the slot is acquired for
use or during a checkpoint. The time since the slot has become
inactive is known from its inactive_since value using which the
timeout is measured.

nit - the wording is too complicated. suggest:
The timeout check occurs when the slot is next acquired for use, or
during a checkpoint. The slot's 'inactive_since' field value is when
the slot became inactive.

~

4.
Note that the inactive timeout invalidation mechanism is not
applicable for slots on the standby that are being synced from a
primary server (whose synced field is true).

nit - that word "whose" seems ambiguous. suggest:
(e.g. the standby slot has 'synced' field true).

======
doc/src/sgml/system-views.sgml

5.
inactive_timeout means that the slot has been inactive for the
duration specified by replication_slot_inactive_timeout parameter.

nit - suggestion ("longer than"):
... the slot has been inactive for longer than the duration specified
by the replication_slot_inactive_timeout parameter.

======
src/backend/replication/slot.c

6.
 /* Maximum number of invalidation causes */
-#define RS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL
+#define RS_INVAL_MAX_CAUSES RS_INVAL_INACTIVE_TIMEOUT

IMO this #define belongs in the slot.h, immediately below where the
enum is defined.

~~~

7. ReplicationSlotAcquire:

I had a fundamental question about this logic.

IIUC the purpose of the patch was to invalidate replication slots that
have been inactive for too long.

So, it makes sense to me that some periodic processing (e.g.
CheckPointReplicationSlots) might do a sweep over all the slots, and
invalidate the too-long-inactive ones that it finds.

OTOH, it seemed quite strange to me that the patch logic is also
detecting and invalidating inactive slots during the
ReplicationSlotAcquire function. This is kind of saying "ERROR -
sorry, because this was inactive for too long you can't have it" at
the very moment that you wanted to use it again! IIUC such a slot
would be invalidated by the function InvalidatePossiblyObsoleteSlot(),
but therein lies my doubt -- how can the slot be considered as
"obsolete" when we are in the very act of trying to acquire/use it?

I guess it might be argued this is not so different to the scenario of
attempting to acquire a slot that had been invalidated momentarily
before during checkpoint processing. But, somehow that scenario seems
more like bad luck to me, versus ReplicationSlotAcquire() deliberately
invalidating something we *know* is wanted.

~

8.
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("can no longer get changes from replication slot \"%s\"",
+ NameStr(s->data.name)),
+ errdetail("This slot has been invalidated because it was inactive
since %s for more than %d seconds specified by
\"replication_slot_inactive_timeout\".",
+    timestamptz_to_str(s->inactive_since),
+    replication_slot_inactive_timeout)));

nit - IMO the info should be split into errdetail + errhint. Like this:
errdetail("The slot became invalid because it was inactive since %s,
which is more than %d seconds ago."...)
errhint("You might need to increase \"%s\".",
"replication_slot_inactive_timeout")

~~~

9. ReportSlotInvalidation

+ appendStringInfo(&err_detail,
+ _("The slot has been inactive since %s for more than %d seconds
specified by \"replication_slot_inactive_timeout\"."),
+ timestamptz_to_str(inactive_since),
+ replication_slot_inactive_timeout);
+ break;

IMO this error in ReportSlotInvalidation() should be the same as the
other one from ReplicationSlotAcquire(), which I suggested above
(comment #8) should include a hint. Also, including a hint here will
make this new message consistent with the other errhint (for
"max_slot_wal_keep_size") that is already in this function.

~~~

10. InvalidatePossiblyObsoleteSlot

+ if (cause == RS_INVAL_INACTIVE_TIMEOUT &&
+ (replication_slot_inactive_timeout > 0 &&
+ s->inactive_since > 0 &&
+ !(RecoveryInProgress() && s->data.synced)))

10a. Everything here is && so this has some redundant parentheses.

10b. Actually, IMO this complicated condition is overkill. Won't it be
better to just unconditionally assign
now = GetCurrentTimestamp(); here?

~

11.
+ * Note that we don't invalidate synced slots because,
+ * they are typically considered not active as they don't
+ * perform logical decoding to produce the changes.

nit - tweaked punctuation

~

12.
+ * If the slot can be acquired, do so or if the slot is already ours,
+ * then mark it invalidated.  Otherwise we'll signal the owning
+ * process, below, and retry.

nit - tidied this comment. Suggestion:
If the slot can be acquired, do so and mark it as invalidated. If the
slot is already ours, mark it as invalidated. Otherwise, we'll signal
the owning process below and retry.

~

13.
+ if (active_pid == 0 ||
+ (MyReplicationSlot != NULL &&
+ MyReplicationSlot == s &&
+ active_pid == MyProcPid))

You are already checking MyReplicationSlot == s here, so that extra
check for MyReplicationSlot != NULL is redundant, isn't it?

~~~

14. CheckPointReplicationSlots

 /*
- * Flush all replication slots to disk.
+ * Flush all replication slots to disk. Also, invalidate slots during
+ * non-shutdown checkpoint.
  *
  * It is convenient to flush dirty replication slots at the time of checkpoint.
  * Additionally, in case of a shutdown checkpoint, we also identify the slots

nit - /Also, invalidate slots/Also, invalidate obsolete slots/

======
src/backend/utils/misc/guc_tables.c

15.
+ {"replication_slot_inactive_timeout", PGC_SIGHUP, REPLICATION_SENDING,
+ gettext_noop("Sets the amount of time to wait before invalidating an "
+ "inactive replication slot."),

nit - that is maybe a bit misleading because IIUC there is no real
"waiting" happening anywhere. Suggest:
Sets the amount of time a replication slot can remain inactive before
it will be invalidated.

======

Please take a look at the attached top-up patches. These include
changes for many of the nits above.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 7009350..c96ae53 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -671,9 +671,10 @@ retry:
                                        
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                                         errmsg("can no longer get changes from 
replication slot \"%s\"",
                                                        NameStr(s->data.name)),
-                                        errdetail("This slot has been 
invalidated because it was inactive since %s for more than %d seconds specified 
by \"replication_slot_inactive_timeout\".",
+                                        errdetail("The slot became invalid 
because it was inactive since %s, which is more than %d seconds ago.",
                                                           
timestamptz_to_str(s->inactive_since),
-                                                          
replication_slot_inactive_timeout)));
+                                                          
replication_slot_inactive_timeout),
+                                        errhint("You might need to increase 
\"%s\".", "replication_slot_inactive_timeout")));
                }
        }
 
@@ -1738,9 +1739,9 @@ 
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
                                         * is disabled or slot is currently 
being used or the slot
                                         * on standby is currently being synced 
from the primary.
                                         *
-                                        * Note that we don't invalidate synced 
slots because,
-                                        * they are typically considered not 
active as they don't
-                                        * perform logical decoding to produce 
the changes.
+                                        * Note that we don't invalidate synced 
slots because
+                                        * they are typically considered not 
active, as they don't
+                                        * perform logical decoding to produce 
changes.
                                         */
                                        if (replication_slot_inactive_timeout 
== 0 ||
                                                s->inactive_since == 0 ||
@@ -1789,9 +1790,9 @@ 
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
                active_pid = s->active_pid;
 
                /*
-                * If the slot can be acquired, do so or if the slot is already 
ours,
-                * then mark it invalidated.  Otherwise we'll signal the owning
-                * process, below, and retry.
+                * If the slot can be acquired, do so and mark it as 
invalidated.
+                * If the slot is already ours, mark it as invalidated.
+                * Otherwise, we'll signal the owning process below and retry.
                 */
                if (active_pid == 0 ||
                        (MyReplicationSlot != NULL &&
@@ -1975,7 +1976,7 @@ restart:
 }
 
 /*
- * Flush all replication slots to disk. Also, invalidate slots during
+ * Flush all replication slots to disk. Also, invalidate obsolete slots during
  * non-shutdown checkpoint.
  *
  * It is convenient to flush dirty replication slots at the time of checkpoint.
diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index 861692c..73a5824 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3030,8 +3030,9 @@ struct config_int ConfigureNamesInt[] =
 
        {
                {"replication_slot_inactive_timeout", PGC_SIGHUP, 
REPLICATION_SENDING,
-                       gettext_noop("Sets the amount of time to wait before 
invalidating an "
-                                                "inactive replication slot."),
+
+                       gettext_noop("Sets the amount of time a replication 
slot can remain "
+                                                "inactive before it will be 
invalidated."),
                        NULL,
                        GUC_UNIT_S
                },

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fbbacbe..bd3ce5a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4568,8 +4568,8 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" 
"%p"'  # Windows
       </term>
       <listitem>
        <para>
-        Invalidates replication slots that are inactive for longer than
-        specified amount of time. If this value is specified without units,
+        Invalidate replication slots that are inactive for longer than this
+        amount of time. If this value is specified without units,
         it is taken as seconds. A value of zero (which is default) disables
         the timeout mechanism. This parameter can only be set in
         the <filename>postgresql.conf</filename> file or on the server
@@ -4577,18 +4577,16 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" 
"%p"'  # Windows
        </para>
 
        <para>
-        This invalidation check happens either when the slot is acquired
-        for use or during a checkpoint. The time since the slot has become
-        inactive is known from its
-        <structfield>inactive_since</structfield> value using which the
-        timeout is measured.
+        The timeout check occurs when the slot is next acquired for use, or
+        during a checkpoint. The slot's 
<structfield>inactive_since</structfield>
+        field value is when the slot became inactive.
        </para>
 
        <para>
         Note that the inactive timeout invalidation mechanism is not
         applicable for slots on the standby that are being synced from a
-        primary server (whose <structfield>synced</structfield> field is
-        <literal>true</literal>).
+        primary server (e.g. the standby slot <structfield>synced</structfield>
+        field is true).
        </para>
       </listitem>
      </varlistentry>
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 9e00f7d..f230e6e 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2621,7 +2621,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
         <listitem>
          <para>
           <literal>inactive_timeout</literal> means that the slot has been
-          inactive for the duration specified by
+          inactive for longer than the duration specified by
           <xref linkend="guc-replication-slot-inactive-timeout"/> parameter.
          </para>
         </listitem>

Re: Introduce XID age and inactive timeout based replication slot invalidation

Reply via email to