pg_upgrade: optimize replication slot caught-up check

Masahiko Sawada Mon, 05 Jan 2026 10:03:31 -0800

Hi all,

Commit 29d0a77fa6 improved pg_upgrade to allow migrating logical
slots. Currently, to check if the slots are ready to be migrated, we
call binary_upgrade_logical_slot_has_caught_up() for every single
slot. This checks if there are any unconsumed WAL records. However, we
noticed a performance issue. If there are many slots (e.g., 100 or
more) or if there is a WAL backlog, checking all slots one by one
takes a long time.


Here are some test results from my environment:
With an empty cluster: 1.55s
With 200 slots and 30MB backlog: 15.51s

Commit 6d3d2e8e5 introduced parallel checks per database, but a single
job might still have to check too many slots, causing delays.

Since binary_upgrade_logical_slot_has_caught_up() essentially checks
if any decodable record exists in the database, IIUC it is not
necessary to check every slot. We can optimize this by checking only
the slot with the minimum confirmed_flush_lsn. If that slot is caught
up, we can assume others are too. The attached patch implements this
optimization. With the patch, the test with 200 slots finished in
2.512s. The execution time is now stable regardless of the number of
slots.

One thing to note is that DecodeTXNNeedSkip() also considers
replication origin filters. Theoretically, a plugin could filter out
specific origins, which might lead to different results. However, this
is a very rare case. Even if it happens, it would just result in a
false positive (the upgrade fails safely), so the impact is minimal.
Therefore, the patch simplifies the check to be per-database instead
of per-slot.

Feedback is very welcome.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

v1-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch
Description: Binary data

pg_upgrade: optimize replication slot caught-up check

Reply via email to