Hi, here are some review comments for patch v43-0001. ====== Commit message
1. ... introduces a GUC allowing users set inactive timeout. ~ 1a. You should give the name of the new GUC in the commit message. 1b. /set/to set/ ====== doc/src/sgml/config.sgml GUC "replication_slot_inactive_timeout" 2. Invalidates replication slots that are inactive for longer than specified amount of time nit - suggest use similar wording as the prior GUC (wal_sender_timeout): Invalidate replication slots that are inactive for longer than this amount of time. ~ 3. This invalidation check happens either when the slot is acquired for use or during a checkpoint. The time since the slot has become inactive is known from its inactive_since value using which the timeout is measured. nit - the wording is too complicated. suggest: The timeout check occurs when the slot is next acquired for use, or during a checkpoint. The slot's 'inactive_since' field value is when the slot became inactive. ~ 4. Note that the inactive timeout invalidation mechanism is not applicable for slots on the standby that are being synced from a primary server (whose synced field is true). nit - that word "whose" seems ambiguous. suggest: (e.g. the standby slot has 'synced' field true). ====== doc/src/sgml/system-views.sgml 5. inactive_timeout means that the slot has been inactive for the duration specified by replication_slot_inactive_timeout parameter. nit - suggestion ("longer than"): ... the slot has been inactive for longer than the duration specified by the replication_slot_inactive_timeout parameter. ====== src/backend/replication/slot.c 6. /* Maximum number of invalidation causes */ -#define RS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL +#define RS_INVAL_MAX_CAUSES RS_INVAL_INACTIVE_TIMEOUT IMO this #define belongs in the slot.h, immediately below where the enum is defined. ~~~ 7. ReplicationSlotAcquire: I had a fundamental question about this logic. IIUC the purpose of the patch was to invalidate replication slots that have been inactive for too long. So, it makes sense to me that some periodic processing (e.g. CheckPointReplicationSlots) might do a sweep over all the slots, and invalidate the too-long-inactive ones that it finds. OTOH, it seemed quite strange to me that the patch logic is also detecting and invalidating inactive slots during the ReplicationSlotAcquire function. This is kind of saying "ERROR - sorry, because this was inactive for too long you can't have it" at the very moment that you wanted to use it again! IIUC such a slot would be invalidated by the function InvalidatePossiblyObsoleteSlot(), but therein lies my doubt -- how can the slot be considered as "obsolete" when we are in the very act of trying to acquire/use it? I guess it might be argued this is not so different to the scenario of attempting to acquire a slot that had been invalidated momentarily before during checkpoint processing. But, somehow that scenario seems more like bad luck to me, versus ReplicationSlotAcquire() deliberately invalidating something we *know* is wanted. ~ 8. + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("can no longer get changes from replication slot \"%s\"", + NameStr(s->data.name)), + errdetail("This slot has been invalidated because it was inactive since %s for more than %d seconds specified by \"replication_slot_inactive_timeout\".", + timestamptz_to_str(s->inactive_since), + replication_slot_inactive_timeout))); nit - IMO the info should be split into errdetail + errhint. Like this: errdetail("The slot became invalid because it was inactive since %s, which is more than %d seconds ago."...) errhint("You might need to increase \"%s\".", "replication_slot_inactive_timeout") ~~~ 9. ReportSlotInvalidation + appendStringInfo(&err_detail, + _("The slot has been inactive since %s for more than %d seconds specified by \"replication_slot_inactive_timeout\"."), + timestamptz_to_str(inactive_since), + replication_slot_inactive_timeout); + break; IMO this error in ReportSlotInvalidation() should be the same as the other one from ReplicationSlotAcquire(), which I suggested above (comment #8) should include a hint. Also, including a hint here will make this new message consistent with the other errhint (for "max_slot_wal_keep_size") that is already in this function. ~~~ 10. InvalidatePossiblyObsoleteSlot + if (cause == RS_INVAL_INACTIVE_TIMEOUT && + (replication_slot_inactive_timeout > 0 && + s->inactive_since > 0 && + !(RecoveryInProgress() && s->data.synced))) 10a. Everything here is && so this has some redundant parentheses. 10b. Actually, IMO this complicated condition is overkill. Won't it be better to just unconditionally assign now = GetCurrentTimestamp(); here? ~ 11. + * Note that we don't invalidate synced slots because, + * they are typically considered not active as they don't + * perform logical decoding to produce the changes. nit - tweaked punctuation ~ 12. + * If the slot can be acquired, do so or if the slot is already ours, + * then mark it invalidated. Otherwise we'll signal the owning + * process, below, and retry. nit - tidied this comment. Suggestion: If the slot can be acquired, do so and mark it as invalidated. If the slot is already ours, mark it as invalidated. Otherwise, we'll signal the owning process below and retry. ~ 13. + if (active_pid == 0 || + (MyReplicationSlot != NULL && + MyReplicationSlot == s && + active_pid == MyProcPid)) You are already checking MyReplicationSlot == s here, so that extra check for MyReplicationSlot != NULL is redundant, isn't it? ~~~ 14. CheckPointReplicationSlots /* - * Flush all replication slots to disk. + * Flush all replication slots to disk. Also, invalidate slots during + * non-shutdown checkpoint. * * It is convenient to flush dirty replication slots at the time of checkpoint. * Additionally, in case of a shutdown checkpoint, we also identify the slots nit - /Also, invalidate slots/Also, invalidate obsolete slots/ ====== src/backend/utils/misc/guc_tables.c 15. + {"replication_slot_inactive_timeout", PGC_SIGHUP, REPLICATION_SENDING, + gettext_noop("Sets the amount of time to wait before invalidating an " + "inactive replication slot."), nit - that is maybe a bit misleading because IIUC there is no real "waiting" happening anywhere. Suggest: Sets the amount of time a replication slot can remain inactive before it will be invalidated. ====== Please take a look at the attached top-up patches. These include changes for many of the nits above. ====== Kind Regards, Peter Smith. Fujitsu Australia
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 7009350..c96ae53 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -671,9 +671,10 @@ retry: (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("can no longer get changes from replication slot \"%s\"", NameStr(s->data.name)), - errdetail("This slot has been invalidated because it was inactive since %s for more than %d seconds specified by \"replication_slot_inactive_timeout\".", + errdetail("The slot became invalid because it was inactive since %s, which is more than %d seconds ago.", timestamptz_to_str(s->inactive_since), - replication_slot_inactive_timeout))); + replication_slot_inactive_timeout), + errhint("You might need to increase \"%s\".", "replication_slot_inactive_timeout"))); } } @@ -1738,9 +1739,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, * is disabled or slot is currently being used or the slot * on standby is currently being synced from the primary. * - * Note that we don't invalidate synced slots because, - * they are typically considered not active as they don't - * perform logical decoding to produce the changes. + * Note that we don't invalidate synced slots because + * they are typically considered not active, as they don't + * perform logical decoding to produce changes. */ if (replication_slot_inactive_timeout == 0 || s->inactive_since == 0 || @@ -1789,9 +1790,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, active_pid = s->active_pid; /* - * If the slot can be acquired, do so or if the slot is already ours, - * then mark it invalidated. Otherwise we'll signal the owning - * process, below, and retry. + * If the slot can be acquired, do so and mark it as invalidated. + * If the slot is already ours, mark it as invalidated. + * Otherwise, we'll signal the owning process below and retry. */ if (active_pid == 0 || (MyReplicationSlot != NULL && @@ -1975,7 +1976,7 @@ restart: } /* - * Flush all replication slots to disk. Also, invalidate slots during + * Flush all replication slots to disk. Also, invalidate obsolete slots during * non-shutdown checkpoint. * * It is convenient to flush dirty replication slots at the time of checkpoint. diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 861692c..73a5824 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -3030,8 +3030,9 @@ struct config_int ConfigureNamesInt[] = { {"replication_slot_inactive_timeout", PGC_SIGHUP, REPLICATION_SENDING, - gettext_noop("Sets the amount of time to wait before invalidating an " - "inactive replication slot."), + + gettext_noop("Sets the amount of time a replication slot can remain " + "inactive before it will be invalidated."), NULL, GUC_UNIT_S },
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index fbbacbe..bd3ce5a 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4568,8 +4568,8 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows </term> <listitem> <para> - Invalidates replication slots that are inactive for longer than - specified amount of time. If this value is specified without units, + Invalidate replication slots that are inactive for longer than this + amount of time. If this value is specified without units, it is taken as seconds. A value of zero (which is default) disables the timeout mechanism. This parameter can only be set in the <filename>postgresql.conf</filename> file or on the server @@ -4577,18 +4577,16 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows </para> <para> - This invalidation check happens either when the slot is acquired - for use or during a checkpoint. The time since the slot has become - inactive is known from its - <structfield>inactive_since</structfield> value using which the - timeout is measured. + The timeout check occurs when the slot is next acquired for use, or + during a checkpoint. The slot's <structfield>inactive_since</structfield> + field value is when the slot became inactive. </para> <para> Note that the inactive timeout invalidation mechanism is not applicable for slots on the standby that are being synced from a - primary server (whose <structfield>synced</structfield> field is - <literal>true</literal>). + primary server (e.g. the standby slot <structfield>synced</structfield> + field is true). </para> </listitem> </varlistentry> diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml index 9e00f7d..f230e6e 100644 --- a/doc/src/sgml/system-views.sgml +++ b/doc/src/sgml/system-views.sgml @@ -2621,7 +2621,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx <listitem> <para> <literal>inactive_timeout</literal> means that the slot has been - inactive for the duration specified by + inactive for longer than the duration specified by <xref linkend="guc-replication-slot-inactive-timeout"/> parameter. </para> </listitem>