Hi, On Tue, May 13, 2025 at 3:48 PM shveta malik <shveta.ma...@gmail.com> wrote: > > > With the given script, the problem reproduces on Head and PG17. We are > trying to reproduce the issue on PG16 and below where injection points > are not there. >
The issue can also be reproduced on PostgreSQL versions 13 through 16. The same steps shared earlier in the 'reproduce_data_duplicate_without_twophase.sh' script can be used to reproduce the issue on versions PG14 to PG16. Since back branches do not support injection points, you can add infinite loops at the locations where the patch 'v1-0001-Injection-points-to-reproduce-the-confirmed_flush.patch introduces injection points'. These loops allow holding and releasing processes using a debugger when needed. Attached are detailed documents describing the reproduction steps: 1) Use 'reproduce_steps_for_pg14_to_16.txt' for PG14 to PG16. 2) Use 'reproduce_steps_for_pg13.txt' for PG13. Note: PG13 uses temporary replication slots for tablesync workers, unlike later versions that use permanent slots. Because of this difference, some debugger-related steps differ slightly in PG13, which is why a separate document is provided for it. -- Thanks, Nisha
Below are the detailed steps to follow to reproduce the issue on PG14 to PG16 versions using debugger: (Note: Since these steps are intended to be run manually, short delays like sleep 1 between steps are assumed and not explicitly mentioned. Any wait time longer than one second is explicitly called out.) -------------------- 1. Set up the primary and subscriber nodes with the same configurations as shared in reproduce_data_duplicate_without_twophase.sh. (The script can be used to do the initial setup) 2. On Primary: Create table tab1, insert a value and create a publication psql -d postgres -p $port_primary -c "CREATE TABLE tab1(a int); INSERT INTO tab1 VALUES(1); CREATE PUBLICATION pub FOR TABLE tab1;" 3. On Subscriber: Create the same table tab1 psql -d postgres -p $port_subscriber -c "CREATE TABLE tab1(a int);" 4. On Subscriber: Start the subscription with copy_data to false psql -d postgres -p $port_subscriber -c "CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres port=$port_primary' PUBLICATION pub WITH (slot_name='logicalslot', create_slot=true, copy_data = false, enabled=true)" 5. Primary: Confirm the slot details psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'" 6. Insert the data into tab1. The apply worker's origin_lsn and slot's confirmed_flush will be advanced to this INSERT lsn (say lsn1) psql -d postgres -p $port_primary -c "INSERT INTO tab1 VALUES(2);" 7. Check both confirmed_flush and origin_lsn values, both values should now match the LSN of the insert above (lsn1). psql -d postgres -p $port_subscriber -c "select * from pg_replication_origin_status where local_id = 1;" psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'" 8. Add a new table (tab2) to the publication on Primary psql -d postgres -p $port_primary -c "CREATE TABLE tab2 (a int UNIQUE); ALTER PUBLICATION pub ADD TABLE tab2;" 9. Create tab2 on Subscriber psql -d postgres -p $port_subscriber -c "CREATE TABLE tab2 (a int UNIQUE);" 10. Refresh the subscription. It will start tablesync for tab2. psql -d postgres -p $port_subscriber -c "ALTER SUBSCRIPTION sub REFRESH PUBLICATION" 11. Attach debugger to the tablesync worker and hold it just before it sets the state to SUBREL_STATE_SYNCWAIT. 12. On Primary: Insert a row into tab2. Lets say the remote lsn for this change is lsn2. psql -d postgres -p $port_primary -c "INSERT INTO tab2 VALUES(2);" 13. Wait for 3+ seconds. The above insert will not be consumed by tablesync worker on sub yet. Apply worker will see this change and will ignore it. 14. Check that confirmed_flush has moved to lsn2 now (where lsn2 > lsn1 ) due to keepalive message handling in apply worker. And origin_lsn remains unchanged. psql -d postgres -p $port_subscriber -c "select * from pg_replication_origin_status where local_id = 1;" psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'" 15. Attach another debugger to apply-worker process and hold it just before maybe_reread_subscription() call. 16. Release the tablesync worker from debugger - It will now move to SUBREL_STATE_SYNCWAIT state and will wait for apply worker to move to SUBREL_STATE_CATCHUP. 17. Disable the sub psql -d postgres -p $port_subscriber -c "alter subscription sub disable;" 18. Release the apply worker from debugger. - It will re-read subscription and move the state to SUBREL_STATE_CATCHUP. Then will exit due to sub being disabled. Tablesync will also exit here. 19. Enable the subscription again and let the apply worker start. - Tablesync now catchup (consume the data inserted above in tab2) and move the state to SUBREL_STATE_SYNCDONE. - Then, apply-worker will move the state to SUBREL_STATE_READY. psql -d postgres -p $port_subscriber -c "alter subscription sub enable;" ----------- Wait here for 3+ seconds, now the state is: --table sync is finished on sub, changes are synced upton lsn2 --apply worker has processed and ignored the changes upto lsn2 without updating origin_lsn --apply worker's origin_lsn at sub is still lsn1 --confirmed_flush on pub is at lsn2 ----------- 20. Check the lsn values psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'" 21. Disable sub psql -d postgres -p $port_subscriber -c "alter subscription sub disable;" 22. Re-enable the sub and attach debugger to walsender(Primary) and hold it just before ProcessRepliesIfAny(). This will stop it from processing any more replies or send any more keepalive. psql -d postgres -p $port_subscriber -c "alter subscription sub enable;" -- Due to lack of any message from walsender, let apply worker send feedback with flush position as lsn1 (origin_lsn). But this will only be processed by walsender after we detach it from debugger. 23. Check origin is still at lsn1 and confirmed_flush at lsn2 psql -d postgres -p $port_subscriber -c "select * from pg_replication_origin_status where local_id = 1;" psql -d postgres -p $port_primary -c "SELECT slot_name, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'" 24. Disable the subscription psql -d postgres -p $port_subscriber -c "alter subscription sub disable;" 25. Detach the debugger from walsender process - Before exit, it will process the reply from apply worker and move the confirmed_flush to lsn1. 26. Check confirmed_flush is now moved to lsn1 psql -d postgres -p $port_primary -c "SELECT slot_name, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'" 27. Enable the subscription. - Now, the walsender will start streaming from lsn1, This will result in replay of 'INSERT to tab2 (lsn2)' and data duplication in tab2. psql -d postgres -p $port_subscriber -c "alter subscription sub enable;" -----------------------------------
Below are the detailed steps to follow to reproduce the issue on PG13: (Note: Since these steps are intended to be run manually, short delays like sleep 1 between steps are assumed and not explicitly mentioned. Any wait time longer than one second is explicitly called out.) -------------------- 1. Set up the primary and subscriber nodes with the same configurations as shared in reproduce_data_duplicate_without_twophase.sh. (The script can be used to do the initial setup) 2. On Primary: Create table tab1, insert a value and create a publication psql -d postgres -p $port_primary -c "CREATE TABLE tab1(a int); INSERT INTO tab1 VALUES(1); CREATE PUBLICATION pub FOR TABLE tab1;" 3. On Subscriber: Create the same table tab1 psql -d postgres -p $port_subscriber -c "CREATE TABLE tab1(a int);" 4. On Subscriber: Start the subscription with copy_data to false psql -d postgres -p $port_subscriber -c "CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres port=$port_primary' PUBLICATION pub WITH (slot_name='logicalslot', create_slot=true, copy_data = false, enabled=true)" 5. Primary: Confirm the slot details psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'" 6. Insert the data into tab1. The apply worker's origin_lsn and slot's confirmed_flush will be advanced to this INSERT lsn (say lsn1) psql -d postgres -p $port_primary -c "INSERT INTO tab1 VALUES(2);" 7. Check both confirmed_flush and origin_lsn values, both values should now match the LSN of the insert above (lsn1). psql -d postgres -p $port_subscriber -c "select * from pg_replication_origin_status where local_id = 1;" psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'" 8. Add a new table (tab2) to the publication on Primary psql -d postgres -p $port_primary -c "CREATE TABLE tab2 (a int UNIQUE); ALTER PUBLICATION pub ADD TABLE tab2;" 9. Create tab2 on Subscriber psql -d postgres -p $port_subscriber -c "CREATE TABLE tab2 (a int UNIQUE);" 10. Refresh the subscription. It will start tablesync for tab2. psql -d postgres -p $port_subscriber -c "ALTER SUBSCRIPTION sub REFRESH PUBLICATION" 11. Attach debugger to the tablesync worker and hold it just before it sets the state to SUBREL_STATE_SYNCWAIT. 12. On Primary: Insert a row into tab2. Lets say the remote lsn for this change is lsn2. psql -d postgres -p $port_primary -c "INSERT INTO tab2 VALUES(2);" 13. Wait for 3+ seconds. The above insert will not be consumed by tablesync worker on sub yet. Apply worker will see this change and will ignore it. 14. Check that confirmed_flush has moved to lsn2 now (where lsn2 > lsn1 ) due to keepalive message handling in apply worker. And origin_lsn remains unchanged. psql -d postgres -p $port_subscriber -c "select * from pg_replication_origin_status where local_id = 1;" psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'" 15. Attach another debugger to apply-worker process and hold it just after maybe_reread_subscription() call but before process_syncing_tables(). 16. At tablesync debugger: release worker's hold, continue it. It will wait for apply worker to move the state to SUBREL_STATE_CATCHUP to proceed. 17. At apply worker debugger: release hold and let it finish process_syncing_tables(), it will move the state to SUBREL_STATE_CATCHUP. BUT hold it again just after process_syncing_tables() finishes. 18. Concurrently, before the tablesync catches up, attach the debugger to tablesync worker again, wait till process_syncing_tables_for_sync() call to hit and hold it just before it sets the state to SUBREL_STATE_SYNCDONE. Now, the tablesync will catchup (consume the data inserted above in tab2) and should wait in debugger just before setting the state to SUBREL_STATE_SYNCDONE. 19. Release the tablesync worker from debugger - It will now move to SUBREL_STATE_SYNCDONE state. Note: apply worker is on hold just after process_syncing_tables(), so will not move the state to READY yet. 20. Disable the sub psql -d postgres -p $port_subscriber -c "alter subscription sub disable;" 21. Release the apply worker from debugger. - It will exit due to sub being disabled. Tablesync will also exit here. 22. Enable the subscription again and let the apply worker start. - Apply-worker will move the state to SUBREL_STATE_READY. psql -d postgres -p $port_subscriber -c "alter subscription sub enable;" ----------- Wait here for 3+ seconds, now the state is: --table sync is finished on sub, changes are synced upton lsn2 --apply worker has processed and ignored the changes upto lsn2 without updating origin_lsn --apply worker's origin_lsn at sub is still lsn1 --confirmed_flush on pub is at lsn2 ----------- 23. Check the lsn values psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'" 24. Disable sub psql -d postgres -p $port_subscriber -c "alter subscription sub disable;" 25. Re-enable the sub and attach debugger to walsender(Primary) and hold it just before ProcessRepliesIfAny(). This will stop it from processing any more replies or send any more keepalive. psql -d postgres -p $port_subscriber -c "alter subscription sub enable;" -- Due to lack of any message from walsender, let apply worker send feedback with flush position as lsn1 (origin_lsn). But this will only be processed by walsender after we detach it from debugger. 26. Check origin is still at lsn1 and confirmed_flush at lsn2 psql -d postgres -p $port_subscriber -c "select * from pg_replication_origin_status where local_id = 1;" psql -d postgres -p $port_primary -c "SELECT slot_name, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'" 27. Disable the subscription psql -d postgres -p $port_subscriber -c "alter subscription sub disable;" 28. Detach the debugger from walsender process - Before exit, it will process the reply from apply worker and move the confirmed_flush to lsn1. 29. Check confirmed_flush is now moved to lsn1 psql -d postgres -p $port_primary -c "SELECT slot_name, confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'" 30. Enable the subscription. - Now, the walsender will start streaming from lsn1, This will result in replay of 'INSERT to tab2 (lsn2)' and data duplication in tab2. psql -d postgres -p $port_subscriber -c "alter subscription sub enable;" -----------------------------------