Hi all, Commit 29d0a77fa6 improved pg_upgrade to allow migrating logical slots. Currently, to check if the slots are ready to be migrated, we call binary_upgrade_logical_slot_has_caught_up() for every single slot. This checks if there are any unconsumed WAL records. However, we noticed a performance issue. If there are many slots (e.g., 100 or more) or if there is a WAL backlog, checking all slots one by one takes a long time.
Here are some test results from my environment: With an empty cluster: 1.55s With 200 slots and 30MB backlog: 15.51s Commit 6d3d2e8e5 introduced parallel checks per database, but a single job might still have to check too many slots, causing delays. Since binary_upgrade_logical_slot_has_caught_up() essentially checks if any decodable record exists in the database, IIUC it is not necessary to check every slot. We can optimize this by checking only the slot with the minimum confirmed_flush_lsn. If that slot is caught up, we can assume others are too. The attached patch implements this optimization. With the patch, the test with 200 slots finished in 2.512s. The execution time is now stable regardless of the number of slots. One thing to note is that DecodeTXNNeedSkip() also considers replication origin filters. Theoretically, a plugin could filter out specific origins, which might lead to different results. However, this is a very rare case. Even if it happens, it would just result in a false positive (the upgrade fails safely), so the impact is minimal. Therefore, the patch simplifies the check to be per-database instead of per-slot. Feedback is very welcome. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
v1-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch
Description: Binary data
