Introduce XID age and inactive timeout based replication slot invalidation

Bharath Rupireddy Wed, 10 Jan 2024 21:18:43 -0800

Hi,

Replication slots in postgres will prevent removal of required
resources when there is no connection using them (inactive). This
consumes storage because neither required WAL nor required rows from
the user tables/system catalogs can be removed by VACUUM as long as
they are required by a replication slot. In extreme cases this could
cause the transaction ID wraparound.


Currently postgres has the ability to invalidate inactive replication
slots based on the amount of WAL (set via max_slot_wal_keep_size GUC)
that will be needed for the slots in case they become active. However,
the wraparound issue isn't effectively covered by
max_slot_wal_keep_size - one can't tell postgres to invalidate a
replication slot if it is blocking VACUUM. Also, it is often tricky to
choose a default value for max_slot_wal_keep_size, because the amount
of WAL that gets generated and allocated storage for the database can
vary.

Therefore, it is often easy for developers to do the following:
a) set an XID age (age of slot's xmin or catalog_xmin) of say 1 or 1.5
billion, after which the slots get invalidated.
b) set a timeout of say 1 or 2 or 3 days, after which the inactive
slots get invalidated.

To implement (a), postgres needs a new GUC called max_slot_xid_age.
The checkpointer then invalidates all the slots whose xmin (the oldest
transaction that this slot needs the database to retain) or
catalog_xmin (the oldest transaction affecting the system catalogs
that this slot needs the database to retain) has reached the age
specified by this setting.

To implement (b), first postgres needs to track the replication slot
metrics like the time at which the slot became inactive (inactive_at
timestamptz) and the total number of times the slot became inactive in
its lifetime (inactive_count numeric) in ReplicationSlotPersistentData
structure. And, then it needs a new timeout GUC called
inactive_replication_slot_timeout. Whenever a slot becomes inactive,
the current timestamp and inactive count are stored in
ReplicationSlotPersistentData structure and persisted to disk. The
checkpointer then invalidates all the slots that are lying inactive
for about inactive_replication_slot_timeout duration starting from
inactive_at.

In addition to implementing (b), these two new metrics enable
developers to improve their monitoring tools as the metrics are
exposed via pg_replication_slots system view. For instance, one can
build a monitoring tool that signals when replication slots are lying
inactive for a day or so using inactive_at metric, and/or when a
replication slot is becoming inactive too frequently using inactive_at
metric.

I’m attaching the v1 patch set as described below:
0001 - Tracks invalidation_reason in pg_replication_slots. This is
needed because slots now have multiple reasons for slot invalidation.
0002 - Tracks inactive replication slot information inactive_at and
inactive_timeout.
0003 - Adds inactive_timeout based replication slot invalidation.
0004 - Adds XID based replication slot invalidation.

Thoughts?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

v1-0001-Track-invalidation_reason-in-pg_replication_slots.patch
Description: Binary data

v1-0002-Track-inactive-replication-slot-information.patch
Description: Binary data

v1-0003-Add-inactive_timeout-based-replication-slot-inval.patch
Description: Binary data

v1-0004-Add-XID-based-replication-slot-invalidation.patch
Description: Binary data

Introduce XID age and inactive timeout based replication slot invalidation

Reply via email to