Re: Introduce XID age and inactive timeout based replication slot invalidation

Dilip Kumar Tue, 06 Feb 2024 00:46:51 -0800

On Thu, Jan 11, 2024 at 10:48 AM Bharath Rupireddy
<bharath.rupireddyforpostg...@gmail.com> wrote:
>
> Hi,
>
> Replication slots in postgres will prevent removal of required
> resources when there is no connection using them (inactive). This
> consumes storage because neither required WAL nor required rows from
> the user tables/system catalogs can be removed by VACUUM as long as
> they are required by a replication slot. In extreme cases this could
> cause the transaction ID wraparound.
>
> Currently postgres has the ability to invalidate inactive replication
> slots based on the amount of WAL (set via max_slot_wal_keep_size GUC)
> that will be needed for the slots in case they become active. However,
> the wraparound issue isn't effectively covered by
> max_slot_wal_keep_size - one can't tell postgres to invalidate a
> replication slot if it is blocking VACUUM. Also, it is often tricky to
> choose a default value for max_slot_wal_keep_size, because the amount
> of WAL that gets generated and allocated storage for the database can
> vary.
>
> Therefore, it is often easy for developers to do the following:
> a) set an XID age (age of slot's xmin or catalog_xmin) of say 1 or 1.5
> billion, after which the slots get invalidated.
> b) set a timeout of say 1 or 2 or 3 days, after which the inactive
> slots get invalidated.
>
> To implement (a), postgres needs a new GUC called max_slot_xid_age.
> The checkpointer then invalidates all the slots whose xmin (the oldest
> transaction that this slot needs the database to retain) or
> catalog_xmin (the oldest transaction affecting the system catalogs
> that this slot needs the database to retain) has reached the age
> specified by this setting.
>
> To implement (b), first postgres needs to track the replication slot
> metrics like the time at which the slot became inactive (inactive_at
> timestamptz) and the total number of times the slot became inactive in
> its lifetime (inactive_count numeric) in ReplicationSlotPersistentData
> structure. And, then it needs a new timeout GUC called
> inactive_replication_slot_timeout. Whenever a slot becomes inactive,
> the current timestamp and inactive count are stored in
> ReplicationSlotPersistentData structure and persisted to disk. The
> checkpointer then invalidates all the slots that are lying inactive
> for about inactive_replication_slot_timeout duration starting from
> inactive_at.
>
> In addition to implementing (b), these two new metrics enable
> developers to improve their monitoring tools as the metrics are
> exposed via pg_replication_slots system view. For instance, one can
> build a monitoring tool that signals when replication slots are lying
> inactive for a day or so using inactive_at metric, and/or when a
> replication slot is becoming inactive too frequently using inactive_at
> metric.
>
> I’m attaching the v1 patch set as described below:
> 0001 - Tracks invalidation_reason in pg_replication_slots. This is
> needed because slots now have multiple reasons for slot invalidation.
> 0002 - Tracks inactive replication slot information inactive_at and
> inactive_timeout.
> 0003 - Adds inactive_timeout based replication slot invalidation.
> 0004 - Adds XID based replication slot invalidation.
>
> Thoughts?
>
+1 for the idea,  here are some comments on 0002, I will review other
patches soon and respond.


1.
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>inactive_at</structfield> <type>timestamptz</type>
+      </para>
+      <para>
+        The time at which the slot became inactive.
+        <literal>NULL</literal> if the slot is currently actively being
+        used.
+      </para></entry>
+     </row>

Maybe we can change the field name to 'last_inactive_at'? or maybe the
comment can explain timestampt at which slot was last inactivated.
I think since we are already maintaining the inactive_count so better
to explicitly say this is the last invaliding time.

2.
+ /*
+ * XXX: Can inactive_count of type uint64 ever overflow? It takes
+ * about a half-billion years for inactive_count to overflow even
+ * if slot becomes inactive for every 1 millisecond. So, using
+ * pg_add_u64_overflow might be an overkill.
+ */

Correct we don't need to use pg_add_u64_overflow for this counter.

3.

+
+ /* Convert to numeric. */
+ snprintf(buf, sizeof buf, UINT64_FORMAT, slot_contents.data.inactive_count);
+ values[i++] = DirectFunctionCall3(numeric_in,
+   CStringGetDatum(buf),
+   ObjectIdGetDatum(0),
+   Int32GetDatum(-1));

What is the purpose of doing this? I mean inactive_count is 8 byte
integer and you can define function outparameter as 'int8' which is 8
byte integer.  Then you don't need to convert int to string and then
to numeric?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Introduce XID age and inactive timeout based replication slot invalidation

Reply via email to