Hi, Replication slots in postgres will prevent removal of required resources when there is no connection using them (inactive). This consumes storage because neither required WAL nor required rows from the user tables/system catalogs can be removed by VACUUM as long as they are required by a replication slot. In extreme cases this could cause the transaction ID wraparound.
Currently postgres has the ability to invalidate inactive replication slots based on the amount of WAL (set via max_slot_wal_keep_size GUC) that will be needed for the slots in case they become active. However, the wraparound issue isn't effectively covered by max_slot_wal_keep_size - one can't tell postgres to invalidate a replication slot if it is blocking VACUUM. Also, it is often tricky to choose a default value for max_slot_wal_keep_size, because the amount of WAL that gets generated and allocated storage for the database can vary. Therefore, it is often easy for developers to do the following: a) set an XID age (age of slot's xmin or catalog_xmin) of say 1 or 1.5 billion, after which the slots get invalidated. b) set a timeout of say 1 or 2 or 3 days, after which the inactive slots get invalidated. To implement (a), postgres needs a new GUC called max_slot_xid_age. The checkpointer then invalidates all the slots whose xmin (the oldest transaction that this slot needs the database to retain) or catalog_xmin (the oldest transaction affecting the system catalogs that this slot needs the database to retain) has reached the age specified by this setting. To implement (b), first postgres needs to track the replication slot metrics like the time at which the slot became inactive (inactive_at timestamptz) and the total number of times the slot became inactive in its lifetime (inactive_count numeric) in ReplicationSlotPersistentData structure. And, then it needs a new timeout GUC called inactive_replication_slot_timeout. Whenever a slot becomes inactive, the current timestamp and inactive count are stored in ReplicationSlotPersistentData structure and persisted to disk. The checkpointer then invalidates all the slots that are lying inactive for about inactive_replication_slot_timeout duration starting from inactive_at. In addition to implementing (b), these two new metrics enable developers to improve their monitoring tools as the metrics are exposed via pg_replication_slots system view. For instance, one can build a monitoring tool that signals when replication slots are lying inactive for a day or so using inactive_at metric, and/or when a replication slot is becoming inactive too frequently using inactive_at metric. I’m attaching the v1 patch set as described below: 0001 - Tracks invalidation_reason in pg_replication_slots. This is needed because slots now have multiple reasons for slot invalidation. 0002 - Tracks inactive replication slot information inactive_at and inactive_timeout. 0003 - Adds inactive_timeout based replication slot invalidation. 0004 - Adds XID based replication slot invalidation. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
v1-0001-Track-invalidation_reason-in-pg_replication_slots.patch
Description: Binary data
v1-0002-Track-inactive-replication-slot-information.patch
Description: Binary data
v1-0003-Add-inactive_timeout-based-replication-slot-inval.patch
Description: Binary data
v1-0004-Add-XID-based-replication-slot-invalidation.patch
Description: Binary data