On Mon, Feb 3, 2025 at 12:04 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Mon, Feb 3, 2025 at 6:16 AM Peter Smith <smithpb2...@gmail.com> wrote: > > > > > > 2. > > + appendStringInfo(&err_hint, "You might need to increase \"%s\".", > > + "max_slot_wal_keep_size"); > > break; > > 2a. > > In this case, shouldn't you really be using macro _("You might need to > > increase \"%s\".") so that the common format string would be got using > > gettext()? > > > > ~ > > > > > > ~~~ > > > > 3. > > + appendStringInfo(&err_hint, "You might need to increase \"%s\".", > > + "idle_replication_slot_timeout"); > > + break; > > > > 3a. > > Ditto above. IMO this common format string should be got using macro. > > e.g.: _("You might need to increase \"%s\".") > > > > ~ > > Instead, we can directly use '_(' in errhint as we are doing in one > other similar place "errhint("%s", _(view_updatable_error))));". I > think we didn't use it for errdetail because, in one of the cases, it > needs to use ngettext >
Please find the v67 patches: - The patches have been rebased after separating the inactive_since related changes. - Patches 001 and 002 incorporate the above comments and the comments from [1] and [2]. - No change in patch-003 since the last version. [1] https://www.postgresql.org/message-id/CALDaNm0FS%2BFqQk2dadiJFCMM_MhKROMsJUb%3Db8wtRH6isScQsQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAHut%2BPs_6%2BNBOt%2BKpQQaBG2R3T-FLS93TbUC27uzyDMu%3D37n-Q%40mail.gmail.com -- Thanks, Nisha
From 26c4c85a709221d59e9911b693c910f489672ade Mon Sep 17 00:00:00 2001 From: Nisha Moond <nisha.moond412@gmail.com> Date: Mon, 3 Feb 2025 15:20:40 +0530 Subject: [PATCH v67 1/3] Introduce inactive_timeout based replication slot invalidation Till now, postgres has the ability to invalidate inactive replication slots based on the amount of WAL (set via max_slot_wal_keep_size GUC) that will be needed for the slots in case they become active. However, choosing a default value for this GUC is a bit tricky. Because the amount of WAL a database generates, and the allocated storage per instance will vary greatly in production, making it difficult to pin down a one-size-fits-all value. It is often easy for users to set a timeout of say 1 or 2 or n days, after which all the inactive slots get invalidated. This commit introduces a GUC named idle_replication_slot_timeout. When set, postgres invalidates slots (during non-shutdown checkpoints) that are idle for longer than this amount of time. Note that the idle timeout invalidation mechanism is not applicable for slots that do not reserve WAL or for slots on the standby server that are being synced from the primary server (i.e., standby slots having 'synced' field 'true'). Synced slots are always considered to be inactive because they don't perform logical decoding to produce changes. --- doc/src/sgml/config.sgml | 40 ++++ doc/src/sgml/logical-replication.sgml | 5 + doc/src/sgml/system-views.sgml | 7 + src/backend/replication/slot.c | 171 ++++++++++++++++-- src/backend/utils/adt/timestamp.c | 18 ++ src/backend/utils/misc/guc_tables.c | 14 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/bin/pg_basebackup/pg_createsubscriber.c | 4 + src/bin/pg_upgrade/server.c | 7 + src/include/replication/slot.h | 3 + src/include/utils/guc_hooks.h | 2 + src/include/utils/timestamp.h | 3 + 12 files changed, 260 insertions(+), 15 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index a782f10998..a065fbbaab 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4423,6 +4423,46 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows </listitem> </varlistentry> + <varlistentry id="guc-idle-replication-slot-timeout" xreflabel="idle_replication_slot_timeout"> + <term><varname>idle_replication_slot_timeout</varname> (<type>integer</type>) + <indexterm> + <primary><varname>idle_replication_slot_timeout</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Invalidate replication slots that have remained idle longer than this + duration. If this value is specified without units, it is taken as + minutes. A value of zero disables the idle timeout invalidation + mechanism. The default is one day. This parameter can only be set in + the <filename>postgresql.conf</filename> file or on the server command + line. + </para> + + <para> + Slot invalidation due to idle timeout occurs during checkpoint. + Because checkpoints happen at <varname>checkpoint_timeout</varname> + intervals, there can be some lag between when the + <varname>idle_replication_slot_timeout</varname> was exceeded and when + the slot invalidation is triggered at the next checkpoint. + To avoid such lags, users can force a checkpoint to promptly invalidate + inactive slots. The duration of slot inactivity is calculated using the + slot's <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>inactive_since</structfield> + value. + </para> + + <para> + Note that the idle timeout invalidation mechanism is not applicable + for slots that do not reserve WAL or for slots on the standby server + that are being synced from the primary server (i.e., standby slots + having <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>synced</structfield> + value <literal>true</literal>). + Synced slots are always considered to be inactive because they don't + perform logical decoding to produce changes. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-wal-sender-timeout" xreflabel="wal_sender_timeout"> <term><varname>wal_sender_timeout</varname> (<type>integer</type>) <indexterm> diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml index 613abcd28b..3d18e507bb 100644 --- a/doc/src/sgml/logical-replication.sgml +++ b/doc/src/sgml/logical-replication.sgml @@ -2390,6 +2390,11 @@ CONTEXT: processing remote data for replication origin "pg_16395" during "INSER plus some reserve for table synchronization. </para> + <para> + Logical replication slots are also affected by + <link linkend="guc-idle-replication-slot-timeout"><varname>idle_replication_slot_timeout</varname></link>. + </para> + <para> <link linkend="guc-max-wal-senders"><varname>max_wal_senders</varname></link> should be set to at least the same as diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml index 8e2b0a7927..88003abee9 100644 --- a/doc/src/sgml/system-views.sgml +++ b/doc/src/sgml/system-views.sgml @@ -2620,6 +2620,13 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx perform logical decoding. It is set only for logical slots. </para> </listitem> + <listitem> + <para> + <literal>idle_timeout</literal> means that the slot has remained + idle longer than the configured + <xref linkend="guc-idle-replication-slot-timeout"/> duration. + </para> + </listitem> </itemizedlist> </para></entry> </row> diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index c57a13d820..4d1e1bc800 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -107,10 +107,11 @@ const char *const SlotInvalidationCauses[] = { [RS_INVAL_WAL_REMOVED] = "wal_removed", [RS_INVAL_HORIZON] = "rows_removed", [RS_INVAL_WAL_LEVEL] = "wal_level_insufficient", + [RS_INVAL_IDLE_TIMEOUT] = "idle_timeout", }; /* Maximum number of invalidation causes */ -#define RS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL +#define RS_INVAL_MAX_CAUSES RS_INVAL_IDLE_TIMEOUT StaticAssertDecl(lengthof(SlotInvalidationCauses) == (RS_INVAL_MAX_CAUSES + 1), "array length mismatch"); @@ -141,6 +142,12 @@ ReplicationSlot *MyReplicationSlot = NULL; int max_replication_slots = 10; /* the maximum number of replication * slots */ +/* + * Invalidate replication slots that have remained idle longer than this + * duration; '0' disables it. + */ +int idle_replication_slot_timeout_mins = HOURS_PER_DAY * MINS_PER_HOUR; + /* * This GUC lists streaming replication standby server slot names that * logical WAL sender processes will wait for. @@ -1518,12 +1525,14 @@ ReportSlotInvalidation(ReplicationSlotInvalidationCause cause, NameData slotname, XLogRecPtr restart_lsn, XLogRecPtr oldestLSN, - TransactionId snapshotConflictHorizon) + TransactionId snapshotConflictHorizon, + TimestampTz inactive_since) { StringInfoData err_detail; - bool hint = false; + StringInfoData err_hint; initStringInfo(&err_detail); + initStringInfo(&err_hint); switch (cause) { @@ -1531,13 +1540,15 @@ ReportSlotInvalidation(ReplicationSlotInvalidationCause cause, { unsigned long long ex = oldestLSN - restart_lsn; - hint = true; appendStringInfo(&err_detail, ngettext("The slot's restart_lsn %X/%X exceeds the limit by %llu byte.", "The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.", ex), LSN_FORMAT_ARGS(restart_lsn), ex); + /* translator: %s is a GUC variable name */ + appendStringInfo(&err_hint, "You might need to increase \"%s\".", + "max_slot_wal_keep_size"); break; } case RS_INVAL_HORIZON: @@ -1548,6 +1559,19 @@ ReportSlotInvalidation(ReplicationSlotInvalidationCause cause, case RS_INVAL_WAL_LEVEL: appendStringInfoString(&err_detail, _("Logical decoding on standby requires \"wal_level\" >= \"logical\" on the primary server.")); break; + + case RS_INVAL_IDLE_TIMEOUT: + Assert(inactive_since > 0); + + /* translator: second %s is a GUC variable name */ + appendStringInfo(&err_detail, _("The slot's idle time %s exceeds the configured \"%s\" duration."), + timestamptz_to_str(inactive_since), + "idle_replication_slot_timeout"); + /* translator: %s is a GUC variable name */ + appendStringInfo(&err_hint, "You might need to increase \"%s\".", + "idle_replication_slot_timeout"); + break; + case RS_INVAL_NONE: pg_unreachable(); } @@ -1559,9 +1583,36 @@ ReportSlotInvalidation(ReplicationSlotInvalidationCause cause, errmsg("invalidating obsolete replication slot \"%s\"", NameStr(slotname)), errdetail_internal("%s", err_detail.data), - hint ? errhint("You might need to increase \"%s\".", "max_slot_wal_keep_size") : 0); + err_hint.len ? errhint("%s", _(err_hint.data)) : 0); pfree(err_detail.data); + pfree(err_hint.data); +} + +/* + * Can we invalidate an idle replication slot? + * + * Idle timeout invalidation is allowed only when: + * + * 1. Idle timeout is set + * 2. Slot has WAL reserved + * 3. Slot is inactive + * 4. The slot is not being synced from the primary while the server + * is in recovery + * + * Note that the idle timeout invalidation mechanism is not + * applicable for slots on the standby server that are being synced + * from the primary server (i.e., standby slots having 'synced' field 'true'). + * Synced slots are always considered to be inactive because they don't + * perform logical decoding to produce changes. + */ +static inline bool +CanInvalidateIdleSlot(ReplicationSlot *s) +{ + return (idle_replication_slot_timeout_mins > 0 && + !XLogRecPtrIsInvalid(s->data.restart_lsn) && + s->inactive_since > 0 && + !(RecoveryInProgress() && s->data.synced)); } /* @@ -1591,6 +1642,7 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, TransactionId initial_catalog_effective_xmin = InvalidTransactionId; XLogRecPtr initial_restart_lsn = InvalidXLogRecPtr; ReplicationSlotInvalidationCause invalidation_cause_prev PG_USED_FOR_ASSERTS_ONLY = RS_INVAL_NONE; + TimestampTz inactive_since = 0; for (;;) { @@ -1598,6 +1650,7 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, NameData slotname; int active_pid = 0; ReplicationSlotInvalidationCause invalidation_cause = RS_INVAL_NONE; + TimestampTz now = 0; Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED)); @@ -1608,6 +1661,15 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, break; } + if (cause == RS_INVAL_IDLE_TIMEOUT) + { + /* + * We get the current time beforehand to avoid system call while + * holding the spinlock. + */ + now = GetCurrentTimestamp(); + } + /* * Check if the slot needs to be invalidated. If it needs to be * invalidated, and is not currently acquired, acquire it and mark it @@ -1661,6 +1723,21 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, if (SlotIsLogical(s)) invalidation_cause = cause; break; + case RS_INVAL_IDLE_TIMEOUT: + Assert(now > 0); + + /* + * Check if the slot needs to be invalidated due to + * idle_replication_slot_timeout GUC. + */ + if (CanInvalidateIdleSlot(s) && + TimestampDifferenceExceedsSeconds(s->inactive_since, now, + idle_replication_slot_timeout_mins * SECS_PER_MINUTE)) + { + invalidation_cause = cause; + inactive_since = s->inactive_since; + } + break; case RS_INVAL_NONE: pg_unreachable(); } @@ -1711,9 +1788,10 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, /* * The logical replication slots shouldn't be invalidated as GUC - * max_slot_wal_keep_size is set to -1 during the binary upgrade. See - * check_old_cluster_for_valid_slots() where we ensure that no - * invalidated before the upgrade. + * max_slot_wal_keep_size is set to -1 and + * idle_replication_slot_timeout is set to 0 during the binary + * upgrade. See check_old_cluster_for_valid_slots() where we ensure + * that no invalidated before the upgrade. */ Assert(!(*invalidated && SlotIsLogical(s) && IsBinaryUpgrade)); @@ -1745,7 +1823,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, { ReportSlotInvalidation(invalidation_cause, true, active_pid, slotname, restart_lsn, - oldestLSN, snapshotConflictHorizon); + oldestLSN, snapshotConflictHorizon, + inactive_since); if (MyBackendType == B_STARTUP) (void) SendProcSignal(active_pid, @@ -1791,7 +1870,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, ReportSlotInvalidation(invalidation_cause, false, active_pid, slotname, restart_lsn, - oldestLSN, snapshotConflictHorizon); + oldestLSN, snapshotConflictHorizon, + inactive_since); /* done with this slot for now */ break; @@ -1806,14 +1886,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, /* * Invalidate slots that require resources about to be removed. * - * Returns true when any slot have got invalidated. + * Returns true if there are any invalidated slots. * - * Whether a slot needs to be invalidated depends on the cause. A slot is - * removed if it: + * Whether a slot needs to be invalidated depends on the invalidation cause. + * A slot is invalidated if it: * - RS_INVAL_WAL_REMOVED: requires a LSN older than the given segment * - RS_INVAL_HORIZON: requires a snapshot <= the given horizon in the given * db; dboid may be InvalidOid for shared relations - * - RS_INVAL_WAL_LEVEL: is logical + * - RS_INVAL_WAL_LEVEL: is logical and wal_level is insufficient + * - RS_INVAL_IDLE_TIMEOUT: has been idle longer than the configured + * "idle_replication_slot_timeout" duration. * * NB - this runs as part of checkpoint, so avoid raising errors if possible. */ @@ -1866,7 +1948,8 @@ restart: } /* - * Flush all replication slots to disk. + * Flush all replication slots to disk. Also, invalidate obsolete slots during + * non-shutdown checkpoint. * * It is convenient to flush dirty replication slots at the time of checkpoint. * Additionally, in case of a shutdown checkpoint, we also identify the slots @@ -1924,6 +2007,45 @@ CheckPointReplicationSlots(bool is_shutdown) SaveSlotToPath(s, path, LOG); } LWLockRelease(ReplicationSlotAllocationLock); + + if (!is_shutdown) + { + elog(DEBUG1, "performing replication slot invalidation checks"); + + /* + * NB: We will make another pass over replication slots for + * invalidation checks to keep the code simple. Testing shows that + * there is no noticeable overhead (when compared with wal_removed + * invalidation) even if we were to do idle_timeout invalidation of + * thousands of replication slots here. If it is ever proven that this + * assumption is wrong, we will have to perform the invalidation + * checks in the above for loop with the following changes: + * + * - Acquire ControlLock lock once before the loop. + * + * - Call InvalidatePossiblyObsoleteSlot for each slot. + * + * - Handle the cases in which ControlLock gets released just like + * InvalidateObsoleteReplicationSlots does. + * + * - Avoid saving slot info to disk two times for each invalidated + * slot. + * + * XXX: Should we move idle_timeout invalidation check closer to + * wal_removed in CreateCheckPoint and CreateRestartPoint? + * + * XXX: Slot invalidation due to 'idle_timeout' applies only to + * released slots, and is based on the 'idle_replication_slot_timeout' + * GUC. Active slots currently in use for replication are excluded to + * prevent accidental invalidation. Slots where communication between + * the publisher and subscriber is down are also excluded, as they are + * managed by the 'wal_sender_timeout'. + */ + InvalidateObsoleteReplicationSlots(RS_INVAL_IDLE_TIMEOUT, + 0, + InvalidOid, + InvalidTransactionId); + } } /* @@ -2803,3 +2925,22 @@ WaitForStandbyConfirmation(XLogRecPtr wait_for_lsn) ConditionVariableCancelSleep(); } + +/* + * GUC check_hook for idle_replication_slot_timeout + * + * The value of idle_replication_slot_timeout must be set to 0 during + * a binary upgrade. See start_postmaster() in pg_upgrade for more details. + */ +bool +check_idle_replication_slot_timeout(int *newval, void **extra, GucSource source) +{ + if (IsBinaryUpgrade && *newval != 0) + { + GUC_check_errdetail("The value of \"%s\" must be set to 0 during binary upgrade mode.", + "idle_replication_slot_timeout"); + return false; + } + + return true; +} diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c index ba9bae0506..9682f9dbdc 100644 --- a/src/backend/utils/adt/timestamp.c +++ b/src/backend/utils/adt/timestamp.c @@ -1786,6 +1786,24 @@ TimestampDifferenceExceeds(TimestampTz start_time, return (diff >= msec * INT64CONST(1000)); } +/* + * Check if the difference between two timestamps is >= a given + * threshold (expressed in seconds). + */ +bool +TimestampDifferenceExceedsSeconds(TimestampTz start_time, + TimestampTz stop_time, + int threshold_sec) +{ + long secs; + int usecs; + + /* Calculate the difference in seconds */ + TimestampDifference(start_time, stop_time, &secs, &usecs); + + return (secs >= threshold_sec); +} + /* * Convert a time_t to TimestampTz. * diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 71448bb4fd..8b107c6529 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -3048,6 +3048,20 @@ struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"idle_replication_slot_timeout", PGC_SIGHUP, REPLICATION_SENDING, + gettext_noop("Sets the duration a replication slot can remain idle before " + "it is invalidated."), + NULL, + GUC_UNIT_MIN + }, + &idle_replication_slot_timeout_mins, + HOURS_PER_DAY * MINS_PER_HOUR, /* 1 day */ + 0, + INT_MAX / SECS_PER_MINUTE, + check_idle_replication_slot_timeout, NULL, NULL + }, + { {"commit_delay", PGC_SUSET, WAL_SETTINGS, gettext_noop("Sets the delay in microseconds between transaction commit and " diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 079efa1baa..70be3a2ce5 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -326,6 +326,7 @@ # (change requires restart) #wal_keep_size = 0 # in megabytes; 0 disables #max_slot_wal_keep_size = -1 # in megabytes; -1 disables +#idle_replication_slot_timeout = 1d # in minutes; 0 disables #wal_sender_timeout = 60s # in milliseconds; 0 disables #track_commit_timestamp = off # collect timestamp of transaction commit # (change requires restart) diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c index faf18ccf13..f2ef8d5ccc 100644 --- a/src/bin/pg_basebackup/pg_createsubscriber.c +++ b/src/bin/pg_basebackup/pg_createsubscriber.c @@ -1438,6 +1438,10 @@ start_standby_server(const struct CreateSubscriberOptions *opt, bool restricted_ appendPQExpBuffer(pg_ctl_cmd, "\"%s\" start -D ", pg_ctl_path); appendShellString(pg_ctl_cmd, subscriber_dir); appendPQExpBuffer(pg_ctl_cmd, " -s -o \"-c sync_replication_slots=off\""); + + /* Prevent unintended slot invalidation */ + appendPQExpBuffer(pg_ctl_cmd, " -o \"-c idle_replication_slot_timeout=0\""); + if (restricted_access) { appendPQExpBuffer(pg_ctl_cmd, " -o \"-p %s\"", opt->sub_port); diff --git a/src/bin/pg_upgrade/server.c b/src/bin/pg_upgrade/server.c index de6971cde6..873e5b5117 100644 --- a/src/bin/pg_upgrade/server.c +++ b/src/bin/pg_upgrade/server.c @@ -252,6 +252,13 @@ start_postmaster(ClusterInfo *cluster, bool report_and_exit_on_error) if (GET_MAJOR_VERSION(cluster->major_version) >= 1700) appendPQExpBufferStr(&pgoptions, " -c max_slot_wal_keep_size=-1"); + /* + * Use idle_replication_slot_timeout=0 to prevent slot invalidation due to + * idle_timeout by checkpointer process during upgrade. + */ + if (GET_MAJOR_VERSION(cluster->major_version) >= 1800) + appendPQExpBufferStr(&pgoptions, " -c idle_replication_slot_timeout=0"); + /* * Use -b to disable autovacuum and logical replication launcher * (effective in PG17 or later for the latter). diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h index 47ebdaecb6..5c6458485c 100644 --- a/src/include/replication/slot.h +++ b/src/include/replication/slot.h @@ -56,6 +56,8 @@ typedef enum ReplicationSlotInvalidationCause RS_INVAL_HORIZON, /* wal_level insufficient for slot */ RS_INVAL_WAL_LEVEL, + /* idle slot timeout has occurred */ + RS_INVAL_IDLE_TIMEOUT, } ReplicationSlotInvalidationCause; extern PGDLLIMPORT const char *const SlotInvalidationCauses[]; @@ -237,6 +239,7 @@ extern PGDLLIMPORT ReplicationSlot *MyReplicationSlot; /* GUCs */ extern PGDLLIMPORT int max_replication_slots; extern PGDLLIMPORT char *synchronized_standby_slots; +extern PGDLLIMPORT int idle_replication_slot_timeout_mins; /* shmem initialization functions */ extern Size ReplicationSlotsShmemSize(void); diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h index 87999218d6..951451a976 100644 --- a/src/include/utils/guc_hooks.h +++ b/src/include/utils/guc_hooks.h @@ -174,5 +174,7 @@ extern void assign_wal_sync_method(int new_wal_sync_method, void *extra); extern bool check_synchronized_standby_slots(char **newval, void **extra, GucSource source); extern void assign_synchronized_standby_slots(const char *newval, void *extra); +extern bool check_idle_replication_slot_timeout(int *newval, void **extra, + GucSource source); #endif /* GUC_HOOKS_H */ diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h index d26f023fb8..e1d05d6779 100644 --- a/src/include/utils/timestamp.h +++ b/src/include/utils/timestamp.h @@ -143,5 +143,8 @@ extern int date2isoyear(int year, int mon, int mday); extern int date2isoyearday(int year, int mon, int mday); extern bool TimestampTimestampTzRequiresRewrite(void); +extern bool TimestampDifferenceExceedsSeconds(TimestampTz start_time, + TimestampTz stop_time, + int threshold_sec); #endif /* TIMESTAMP_H */ -- 2.34.1
From bfeac59d262bbb024aa90292e079e203304930f5 Mon Sep 17 00:00:00 2001 From: Nisha Moond <nisha.moond412@gmail.com> Date: Thu, 30 Jan 2025 21:07:12 +0530 Subject: [PATCH v67 2/3] Add TAP test for slot invalidation based on inactive timeout. This test uses injection points to bypass the time overhead caused by the idle_replication_slot_timeout GUC, which has a minimum value of one minute. --- src/backend/replication/slot.c | 34 ++++-- src/test/recovery/meson.build | 1 + .../t/044_invalidate_inactive_slots.pl | 103 ++++++++++++++++++ 3 files changed, 129 insertions(+), 9 deletions(-) create mode 100644 src/test/recovery/t/044_invalidate_inactive_slots.pl diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 4d1e1bc800..09e618339c 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -55,6 +55,7 @@ #include "storage/proc.h" #include "storage/procarray.h" #include "utils/builtins.h" +#include "utils/injection_point.h" #include "utils/guc_hooks.h" #include "utils/varlena.h" @@ -1726,16 +1727,31 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, case RS_INVAL_IDLE_TIMEOUT: Assert(now > 0); - /* - * Check if the slot needs to be invalidated due to - * idle_replication_slot_timeout GUC. - */ - if (CanInvalidateIdleSlot(s) && - TimestampDifferenceExceedsSeconds(s->inactive_since, now, - idle_replication_slot_timeout_mins * SECS_PER_MINUTE)) + if (CanInvalidateIdleSlot(s)) { - invalidation_cause = cause; - inactive_since = s->inactive_since; +#ifdef USE_INJECTION_POINTS + + /* + * To test idle timeout slot invalidation, if the + * slot-time-out-inval injection point is attached, + * set inactive_since to a very old timestamp (1 + * microsecond since epoch) to immediately invalidate + * the slot. + */ + if (IS_INJECTION_POINT_ATTACHED("slot-time-out-inval")) + s->inactive_since = 1; +#endif + + /* + * Check if the slot needs to be invalidated due to + * idle_replication_slot_timeout GUC. + */ + if (TimestampDifferenceExceedsSeconds(s->inactive_since, now, + idle_replication_slot_timeout_mins * SECS_PER_MINUTE)) + { + invalidation_cause = cause; + inactive_since = s->inactive_since; + } } break; case RS_INVAL_NONE: diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build index 0428704dbf..057bcde143 100644 --- a/src/test/recovery/meson.build +++ b/src/test/recovery/meson.build @@ -52,6 +52,7 @@ tests += { 't/041_checkpoint_at_promote.pl', 't/042_low_level_backup.pl', 't/043_no_contrecord_switch.pl', + 't/044_invalidate_inactive_slots.pl', ], }, } diff --git a/src/test/recovery/t/044_invalidate_inactive_slots.pl b/src/test/recovery/t/044_invalidate_inactive_slots.pl new file mode 100644 index 0000000000..262e603eb5 --- /dev/null +++ b/src/test/recovery/t/044_invalidate_inactive_slots.pl @@ -0,0 +1,103 @@ +# Copyright (c) 2025, PostgreSQL Global Development Group + +# Test for replication slots invalidation +use strict; +use warnings FATAL => 'all'; + +use PostgreSQL::Test::Utils; +use PostgreSQL::Test::Cluster; +use Test::More; + +if ($ENV{enable_injection_points} ne 'yes') +{ + plan skip_all => 'Injection points not supported by this build'; +} + +# Wait for slot to first become idle and then get invalidated +sub wait_for_slot_invalidation +{ + my ($node, $slot, $offset) = @_; + my $node_name = $node->name; + + # The slot's invalidation should be logged + $node->wait_for_log(qr/invalidating obsolete replication slot \"$slot\"/, + $offset); + + # Check that the invalidation reason is 'idle_timeout' + $node->poll_query_until( + 'postgres', qq[ + SELECT COUNT(slot_name) = 1 FROM pg_replication_slots + WHERE slot_name = '$slot' AND + invalidation_reason = 'idle_timeout'; + ]) + or die + "Timed out while waiting for invalidation reason of slot $slot to be set on node $node_name"; + + # Check that an invalidated slot cannot be acquired + my ($result, $stdout, $stderr); + ($result, $stdout, $stderr) = $node->psql( + 'postgres', qq[ + SELECT pg_replication_slot_advance('$slot', '0/1'); + ]); + ok( $stderr =~ /can no longer access replication slot "$slot"/, + "detected error upon trying to acquire invalidated slot $slot on node $node_name" + ) + or die + "could not detect error upon trying to acquire invalidated slot $slot on node $node_name"; +} + +# ======================================================================== +# Testcase start +# +# Test invalidation of streaming standby slot and logical slot due to idle +# timeout. + +# Initialize primary +my $node = PostgreSQL::Test::Cluster->new('primary'); +$node->init(allows_streaming => 'logical'); + +# Avoid unpredictability +$node->append_conf( + 'postgresql.conf', qq{ +checkpoint_timeout = 1h +idle_replication_slot_timeout = 1min +}); +$node->start; + +# Create both streaming standby and logical slot +$node->safe_psql( + 'postgres', qq[ + SELECT pg_create_physical_replication_slot(slot_name := 'physical_slot', immediately_reserve := true); +]); +$node->psql('postgres', + q{SELECT pg_create_logical_replication_slot('logical_slot', 'test_decoding');} +); + +my $logstart = -s $node->logfile; + +# Register an injection point on the primary to forcibly cause a slot timeout +$node->safe_psql('postgres', 'CREATE EXTENSION injection_points;'); + +# Check if the extension injection_points is available, as it may be +# possible that this script is run with installcheck, where the module +# would not be installed by default. +if (!$node->check_extension('injection_points')) +{ + plan skip_all => 'Extension injection_points not installed'; +} + +$node->safe_psql('postgres', + "SELECT injection_points_attach('slot-time-out-inval', 'error');"); + +# Run a checkpoint which will invalidate the slots +$node->safe_psql('postgres', "CHECKPOINT"); + +# Wait for slots to become inactive. Note that nobody has acquired the slot +# yet, so it must get invalidated due to idle timeout. +wait_for_slot_invalidation($node, 'physical_slot', $logstart); +wait_for_slot_invalidation($node, 'logical_slot', $logstart); + +# Testcase end +# ============================================================================= + +done_testing(); -- 2.34.1
From 48c9e2e788d83ccb9b830080ea75fc15d3bb034c Mon Sep 17 00:00:00 2001 From: Nisha Moond <nisha.moond412@gmail.com> Date: Wed, 29 Jan 2025 12:12:00 +0530 Subject: [PATCH v67 3/3] Add TAP test for slot invalidation based on inactive timeout. This patch adds the same test, but places it under PG_TEST_EXTRA instead of using injection points. Since the minimum value for GUC 'idle_replication_slot_timeout' is one minute, the test takes more than a minute to complete and is disabled by default. Use PG_TEST_EXTRA=idle_replication_slot_timeout with "make" to run the test. --- .cirrus.tasks.yml | 2 +- doc/src/sgml/regress.sgml | 10 + src/test/recovery/README | 5 + src/test/recovery/meson.build | 1 + .../045_invalidate_inactive_slots_pg_extra.pl | 208 ++++++++++++++++++ 5 files changed, 225 insertions(+), 1 deletion(-) create mode 100644 src/test/recovery/t/045_invalidate_inactive_slots_pg_extra.pl diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml index 18e944ca89..8d3c13fcee 100644 --- a/.cirrus.tasks.yml +++ b/.cirrus.tasks.yml @@ -20,7 +20,7 @@ env: MTEST_ARGS: --print-errorlogs --no-rebuild -C build PGCTLTIMEOUT: 120 # avoids spurious failures during parallel tests TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf - PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance + PG_TEST_EXTRA: kerberos ldap ssl libpq_encryption load_balance idle_replication_slot_timeout # What files to preserve in case tests fail diff --git a/doc/src/sgml/regress.sgml b/doc/src/sgml/regress.sgml index 7c474559bd..f5b1f2f353 100644 --- a/doc/src/sgml/regress.sgml +++ b/doc/src/sgml/regress.sgml @@ -347,6 +347,16 @@ make check-world PG_TEST_EXTRA='kerberos ldap ssl load_balance libpq_encryption' </para> </listitem> </varlistentry> + + <varlistentry> + <term><literal>idle_replication_slot_timeout</literal></term> + <listitem> + <para> + Runs the test <filename>src/test/recovery/t/045_invalidate_inactive_slots_pg_extra.pl</filename>. + Not enabled by default because it is time consuming. + </para> + </listitem> + </varlistentry> </variablelist> Tests for features that are not supported by the current build diff --git a/src/test/recovery/README b/src/test/recovery/README index 896df0ad05..5c066fc41f 100644 --- a/src/test/recovery/README +++ b/src/test/recovery/README @@ -30,4 +30,9 @@ PG_TEST_EXTRA=wal_consistency_checking to the "make" command. This is resource-intensive, so it's not done by default. +If you want to test idle_replication_slot_timeout, add +PG_TEST_EXTRA=idle_replication_slot_timeout +to the "make" command. This test takes over a minutes, so it's not done +by default. + See src/test/perl/README for more info about running these tests. diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build index 057bcde143..0a037b4b65 100644 --- a/src/test/recovery/meson.build +++ b/src/test/recovery/meson.build @@ -53,6 +53,7 @@ tests += { 't/042_low_level_backup.pl', 't/043_no_contrecord_switch.pl', 't/044_invalidate_inactive_slots.pl', + 't/045_invalidate_inactive_slots_pg_extra.pl', ], }, } diff --git a/src/test/recovery/t/045_invalidate_inactive_slots_pg_extra.pl b/src/test/recovery/t/045_invalidate_inactive_slots_pg_extra.pl new file mode 100644 index 0000000000..577f69d05d --- /dev/null +++ b/src/test/recovery/t/045_invalidate_inactive_slots_pg_extra.pl @@ -0,0 +1,208 @@ +# Copyright (c) 2025, PostgreSQL Global Development Group + +# Test for replication slots invalidation +use strict; +use warnings FATAL => 'all'; + +use PostgreSQL::Test::Utils; +use PostgreSQL::Test::Cluster; +use Test::More; + +# The test takes over two minutes to complete. Run it only if +# idle_replication_slot_timeout is specified in PG_TEST_EXTRA. +if ( !$ENV{PG_TEST_EXTRA} + || $ENV{PG_TEST_EXTRA} !~ /\bidle_replication_slot_timeout\b/) +{ + plan skip_all => + 'test idle_replication_slot_timeout not enabled in PG_TEST_EXTRA'; +} + +# ============================================================================= +# Testcase start +# +# Test invalidation of streaming standby slot and logical failover slot on the +# primary due to idle timeout. Also, test logical failover slot synced to +# the standby from the primary doesn't get invalidated on its own, but gets the +# invalidated state from the primary. + +# Initialize primary +my $primary = PostgreSQL::Test::Cluster->new('primary'); +$primary->init(allows_streaming => 'logical'); + +# Avoid unpredictability +$primary->append_conf( + 'postgresql.conf', qq{ +checkpoint_timeout = 1h +idle_replication_slot_timeout = 1 +}); +$primary->start; + +# Take backup +my $backup_name = 'my_backup'; +$primary->backup($backup_name); + +# Create sync slot on the primary +$primary->psql('postgres', + q{SELECT pg_create_logical_replication_slot('sync_slot1', 'test_decoding', false, false, true);} +); + +# Create standby1's slot on the primary +$primary->safe_psql( + 'postgres', qq[ + SELECT pg_create_physical_replication_slot(slot_name := 'sb_slot1', immediately_reserve := true); +]); + +# Create standby2's slot on the primary +$primary->safe_psql( + 'postgres', qq[ + SELECT pg_create_physical_replication_slot(slot_name := 'sb_slot2', immediately_reserve := true); +]); + +# Create standby1 +my $standby1 = PostgreSQL::Test::Cluster->new('standby1'); +$standby1->init_from_backup($primary, $backup_name, has_streaming => 1); + +my $connstr = $primary->connstr; +$standby1->append_conf( + 'postgresql.conf', qq( +hot_standby_feedback = on +primary_slot_name = 'sb_slot1' +idle_replication_slot_timeout = 1 +primary_conninfo = '$connstr dbname=postgres' +)); +$standby1->start; + +# Wait until the standby has replayed enough data +$primary->wait_for_catchup($standby1); + +# Create standby2 +my $standby2 = PostgreSQL::Test::Cluster->new('standby2'); +$standby2->init_from_backup($primary, $backup_name, has_streaming => 1); + +$connstr = $primary->connstr; +$standby2->append_conf( + 'postgresql.conf', qq( +hot_standby_feedback = on +primary_slot_name = 'sb_slot2' +idle_replication_slot_timeout = 1 +primary_conninfo = '$connstr dbname=postgres' +)); +$standby2->start; + +# Wait until the standby has replayed enough data +$primary->wait_for_catchup($standby2); + +# Set timeout GUC on the standby to verify that the next checkpoint will not +# invalidate synced slots. + +# Make the standby2's slot on the primary inactive +$standby2->stop; + +# Sync the primary slots to the standby +$standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();"); + +# Confirm that the logical failover slot is created on the standby +is( $standby1->safe_psql( + 'postgres', + q{SELECT count(*) = 1 FROM pg_replication_slots + WHERE slot_name = 'sync_slot1' AND synced + AND NOT temporary + AND invalidation_reason IS NULL;} + ), + 't', + 'logical slot sync_slot1 is synced to standby'); + +# Give enough time for inactive_since to exceed the timeout +sleep(61); + +# On standby, synced slots are not invalidated by the idle timeout +# until the invalidation state is propagated from the primary. +$standby1->safe_psql('postgres', "CHECKPOINT"); +is( $standby1->safe_psql( + 'postgres', + q{SELECT count(*) = 1 FROM pg_replication_slots + WHERE slot_name = 'sync_slot1' + AND invalidation_reason IS NULL;} + ), + 't', + 'check that synced slot sync_slot1 has not been invalidated on standby'); + +my $logstart = -s $primary->logfile; + +# Wait for logical failover slot to become inactive on the primary. Note that +# nobody has acquired the slot yet, so it must get invalidated due to +# idle timeout as 61 seconds has elapsed and wait for another 10 seconds +# to make test reliable. +wait_for_slot_invalidation($primary, 'sync_slot1', $logstart, 10); + +# Re-sync the primary slots to the standby. Note that the primary slot was +# already invalidated (above) due to idle timeout. The standby must just +# sync the invalidated state. +$standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();"); + +is( $standby1->safe_psql( + 'postgres', + q{SELECT count(*) = 1 FROM pg_replication_slots + WHERE slot_name = 'sync_slot1' + AND invalidation_reason = 'idle_timeout';} + ), + "t", + 'check that invalidation of synced slot sync_slot1 is synced on standby'); + +# By now standby2's slot must be invalidated due to idle timeout, +# check for invalidation. +wait_for_slot_invalidation($primary, 'sb_slot2', $logstart, 1); + +# Testcase end +# ============================================================================= + +# Wait for slot to first become idle and then get invalidated +sub wait_for_slot_invalidation +{ + my ($node, $slot, $offset, $wait_time_secs) = @_; + my $node_name = $node->name; + + trigger_slot_invalidation($node, $slot, $offset, $wait_time_secs); + + # Check that an invalidated slot cannot be acquired + my ($result, $stdout, $stderr); + ($result, $stdout, $stderr) = $node->psql( + 'postgres', qq[ + SELECT pg_replication_slot_advance('$slot', '0/1'); + ]); + ok( $stderr =~ /can no longer access replication slot "$slot"/, + "detected error upon trying to acquire invalidated slot $slot on node $node_name" + ) + or die + "could not detect error upon trying to acquire invalidated slot $slot on node $node_name"; +} + +# Trigger slot invalidation and confirm it in the server log +sub trigger_slot_invalidation +{ + my ($node, $slot, $offset, $wait_time_secs) = @_; + my $node_name = $node->name; + my $invalidated = 0; + + # Give enough time for inactive_since to exceed the timeout + sleep($wait_time_secs); + + # Run a checkpoint + $node->safe_psql('postgres', "CHECKPOINT"); + + # The slot's invalidation should be logged + $node->wait_for_log(qr/invalidating obsolete replication slot \"$slot\"/, + $offset); + + # Check that the invalidation reason is 'idle_timeout' + $node->poll_query_until( + 'postgres', qq[ + SELECT COUNT(slot_name) = 1 FROM pg_replication_slots + WHERE slot_name = '$slot' AND + invalidation_reason = 'idle_timeout'; + ]) + or die + "Timed out while waiting for invalidation reason of slot $slot to be set on node $node_name"; +} + +done_testing(); -- 2.34.1