Here are the test steps and analysis for epoch-related handling (Tested with v15 Patch-Set).
In the 'update_deleted' detection design, the launcher process compares XIDs to track minimum XIDs, and the apply workers maintain the oldest running XIDs. The launcher also requests publisher status at regular intervals which also includes the epoch info. So, proper epoch handling is required for the smooth functioning during XID wraparound. Functions requiring epoch handling: 1) 'get_candidate_xid()': Tracks the node's oldest running XID and identifies the next candidate XID to advance the oldest non-removable XID of an apply worker. 2) 'wait_for_publisher_status()': Tracks the publisher’s oldest and next XIDs for monitoring concurrent remote transactions. -- To test the epoch handling, I added extra LOG statements in above functions and the launcher code. The patches for these changes are attached (applies atop v15-0005). The tests confirmed that epoch handling works correctly during XID wraparound on both the publisher and subscriber sides. Detailed test steps and results are provided below. ~~~~ Setup: - Created two nodes, 'Pub' and 'Sub', with logical replication. - On both nodes, configured 'autovacuum_naptime = 1s' to allow frequent vacuuming while consuming XIDs rapidly. - On Sub, created a subscription for a table subscribed to all changes from Pub. - Installed and enabled the 'xid_wraparound' extension on both nodes: CREATE EXTENSION xid_wraparound; ~~~~ ----------------------------------------------------------------- Case-1: When XID Wraparound Happens on Sub ----------------------------------------------------------------- Scenario: In 'get_candidate_xid()', 'oldest_running_xid' and 'next_full_xid' have different epochs, meaning, an old epoch transaction is running, and a xid-wraparound happens on the subscriber. Test Steps: Perform below steps at Sub node: 1. Consume 4 Billion XIDs in Batches (400M each). -- script "consume_4B_xids.sh" is attached which is used to consume the xids. 2. Set 'detect_update_deleted=ON' for the subscription. 3. Hold a transaction with an old XID before Wraparound: -- Start a new session, begin a transaction, and leave it open. This transaction will have an XID close to 4 billion. 4. In another session, trigger wraparound by consuming remaining XIDs (2^32 - 4B): SELECT consume_xids('294966530'); -- At Sub, the newly added log will show that wraparound happened and epoch was handled correctly by choosing the correct candidate_full_xid. LOG: XXX: oldest_running_xid = 4000000762 LOG: XXX: next_full_xid = 766 LOG: XXX: xid WRAPAROUND happened!!! LOG: XXX: candidate full_xid = 4000000762 5. Confirm launcher updates "pg_conflict_detection" slot's xmin with new epoch: - End the open transaction. - Verify that the oldest running xid is now updated to the new epoch xid LOG: XXX: oldest_running_xid = 766 LOG: XXX: next_full_xid = 766 LOG: XXX: candidate full_xid = 766 - Confirm the launcher updates the new epoch xid as xmin: LOG: XXX: launcher new_xid = 766 LOG: XXX: launcher current slot xmin = 4000000762 LOG: XXX: launcher full_xmin = 4000000762 LOG: XXX: launcher updated xmin = 766 postgres=# SELECT slot_name, slot_type, active, xmin FROM pg_replication_slots; slot_name | slot_type | active | xmin -----------------------+-----------+--------+------ pg_conflict_detection | physical | t | 766 (1 row) ~~~~ ----------------------------------------------------------------- Case-2: When XID Wraparound Happens on Pub ----------------------------------------------------------------- Scenario: In 'wait_for_publisher_status()', 'data->last_phase_at' (oldest commiting remote XID) and 'remote_next_xid' have different epochs, meaning, an old epoch transaction is in commit phase on remote(Pub), and wraparound happens on the publisher node. Test Steps: 1. Consume 4 Billion XIDs in Batches (400M each) on the Publisher node. -- script "consume_4B_xids.sh" is attached which is used to consume the xids. 2. At sub, set 'detect_update_deleted=ON' for the subscription. 3. Confirm the latest remote XID are updated on Sub: LOG: XXX: last_phase_at = 4000000796 LOG: XXX: remote_oldestxid = 4000000796 LOG: XXX: remote_nextxid = 4000000796 LOG: XXX: remote_full_xid = 4000000796 LOG: XXX: remote concurrent txn completed 4. Hold a transaction in the commit phase: - Attach a debugger to a session, start a transaction, and hold it at 'XactLogCommitRecord()'. - This step is required because the launcher at sub tracks remote concurrent transactions which are currently committing. 5. In another session, trigger wraparound by consuming remaining XIDs (2^32 - 4B): SELECT consume_xids('294966530'); -- At sub, the logs confirm that wraparound happened on Pub node: LOG: XXX: last_phase_at = 4000000797 LOG: XXX: remote_oldestxid = 4000000796 LOG: XXX: remote_nextxid = 801 LOG: XXX: xid WRAPAROUND happened on Publisher!!! LOG: XXX: remote_full_xid = 4000000796 6. Release debugger and confirm that remote's oldest XID updated to the new epoch: LOG: XXX: last_phase_at = 4000000797 LOG: XXX: remote_oldestxid = 801 LOG: XXX: remote_nextxid = 801 LOG: XXX: remote_full_xid = 801 LOG: XXX: remote concurrent txn completed ----------------------------- END ------------------------------- Thanks, Nisha
#!/bin/bash port=8835 for i in {1..10}; do echo "Running SELECT for i = $i" ./psql -d postgres -p $port -c "SELECT consume_xids(100000000);" done
epoch_test_case-1_logs.patch
Description: Binary data
epoch_test_case-2_logs.patch
Description: Binary data