Hi,

On Tue, May 13, 2025 at 3:48 PM shveta malik <shveta.ma...@gmail.com> wrote:
>
>
> With the given script, the problem reproduces on Head and PG17. We are
> trying to reproduce the issue on PG16 and below where injection points
> are not there.
>

The issue can also be reproduced on PostgreSQL versions 13 through 16.

The same steps shared earlier in the
'reproduce_data_duplicate_without_twophase.sh' script can be used to
reproduce the issue on versions PG14 to PG16.

Since back branches do not support injection points, you can add infinite
loops at the locations where the patch
'v1-0001-Injection-points-to-reproduce-the-confirmed_flush.patch introduces
injection points'. These loops allow holding and releasing processes using
a debugger when needed.

Attached are detailed documents describing the reproduction steps:
 1) Use 'reproduce_steps_for_pg14_to_16.txt' for PG14 to PG16.
 2) Use 'reproduce_steps_for_pg13.txt' for PG13.

Note: PG13 uses temporary replication slots for tablesync workers, unlike
later versions that use permanent slots. Because of this difference, some
debugger-related steps differ slightly in PG13, which is why a separate
document is provided for it.

--
Thanks,
Nisha
Below are the detailed steps to follow to reproduce the issue on PG14 to PG16 
versions using debugger:
(Note: Since these steps are intended to be run manually, short delays like 
sleep 1 between steps are assumed and not explicitly mentioned. Any wait time 
longer than one second is explicitly called out.)
--------------------

1. Set up the primary and subscriber nodes with the same configurations as 
shared in reproduce_data_duplicate_without_twophase.sh. (The script can be used 
to do the initial setup)

2. On Primary: Create table tab1, insert a value and create a publication

psql -d postgres -p $port_primary -c "CREATE TABLE tab1(a int); INSERT INTO 
tab1 VALUES(1); CREATE PUBLICATION pub FOR TABLE tab1;"

3. On Subscriber: Create the same table tab1

psql -d postgres -p $port_subscriber -c "CREATE TABLE tab1(a int);"

4. On Subscriber: Start the subscription with copy_data to false

psql -d postgres -p $port_subscriber -c "CREATE SUBSCRIPTION sub CONNECTION 
'dbname=postgres port=$port_primary' PUBLICATION pub WITH 
(slot_name='logicalslot', create_slot=true, copy_data = false, enabled=true)"

5. Primary: Confirm the slot details

psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, 
confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'"

6. Insert the data into tab1. The apply worker's origin_lsn and slot's 
confirmed_flush will be advanced to this INSERT lsn (say lsn1)

psql -d postgres -p $port_primary -c "INSERT INTO tab1 VALUES(2);"

7. Check both confirmed_flush and origin_lsn values, both values should now 
match the LSN of the insert above (lsn1).
psql -d postgres -p $port_subscriber -c "select * from 
pg_replication_origin_status where local_id = 1;"
psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, 
confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'"

8. Add a new table (tab2) to the publication on Primary
psql -d postgres -p $port_primary -c "CREATE TABLE tab2 (a int UNIQUE); ALTER 
PUBLICATION pub ADD TABLE tab2;"

9. Create tab2 on Subscriber
psql -d postgres -p $port_subscriber -c "CREATE TABLE tab2 (a int UNIQUE);"

10. Refresh the subscription. It will start tablesync for tab2.

psql -d postgres -p $port_subscriber -c "ALTER SUBSCRIPTION sub REFRESH 
PUBLICATION"

11. Attach debugger to the tablesync worker and hold it just before it sets the 
state to SUBREL_STATE_SYNCWAIT.


12. On Primary: Insert a row into tab2. Lets say the remote lsn for this change 
is lsn2.

psql -d postgres -p $port_primary -c "INSERT INTO tab2 VALUES(2);"

13. Wait for 3+ seconds. The above insert will not be consumed by tablesync 
worker on sub yet. Apply worker will see this change and will ignore it.


14. Check that confirmed_flush has moved to lsn2 now (where lsn2 > lsn1 ) due 
to keepalive message handling in apply worker. And origin_lsn remains unchanged.

psql -d postgres -p $port_subscriber -c "select * from 
pg_replication_origin_status where local_id = 1;"

psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, 
confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'"


15. Attach another debugger to apply-worker process and hold it just before 
maybe_reread_subscription() call.

16. Release the tablesync worker from debugger
 - It will now move to SUBREL_STATE_SYNCWAIT state and will wait for apply 
worker to move to SUBREL_STATE_CATCHUP. 

17. Disable the sub
psql -d postgres -p $port_subscriber -c "alter subscription sub disable;"

18. Release the apply worker from debugger.
 - It will re-read subscription and move the state to SUBREL_STATE_CATCHUP. 
Then will exit due to sub being disabled. Tablesync will also exit here.

19. Enable the subscription again and let the apply worker start.
 - Tablesync now catchup (consume the data inserted above in tab2) and move the 
state to SUBREL_STATE_SYNCDONE.
 - Then, apply-worker will move the state to SUBREL_STATE_READY.

psql -d postgres -p $port_subscriber -c "alter subscription sub enable;"

-----------
Wait here for 3+ seconds, now the state is: 
 --table sync is finished on sub, changes are synced upton lsn2
 --apply worker has processed and ignored the changes upto lsn2 without 
updating origin_lsn
 --apply worker's origin_lsn at sub is still lsn1
 --confirmed_flush on pub is at lsn2
-----------

20. Check the lsn values

psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, 
confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'"

21. Disable sub

psql -d postgres -p $port_subscriber -c "alter subscription sub disable;"

22. Re-enable the sub and attach debugger to walsender(Primary) and hold it 
just before ProcessRepliesIfAny(). This will stop it from processing any more 
replies or send any more keepalive. 

psql -d postgres -p $port_subscriber -c "alter subscription sub enable;"

 -- Due to lack of any message from walsender, let apply worker send feedback 
with flush position as lsn1 (origin_lsn). But this will only be processed by 
walsender after we detach it from debugger.


23. Check origin is still at lsn1 and confirmed_flush at lsn2
psql -d postgres -p $port_subscriber -c "select * from 
pg_replication_origin_status where local_id = 1;"
psql -d postgres -p $port_primary -c "SELECT slot_name, confirmed_flush_lsn 
FROM pg_replication_slots WHERE slot_name='logicalslot'"

24. Disable the subscription
psql -d postgres -p $port_subscriber -c "alter subscription sub disable;"

25. Detach the debugger from walsender process
-  Before exit, it will process the reply from apply worker and move the 
confirmed_flush to lsn1.

26. Check confirmed_flush is now moved to lsn1

psql -d postgres -p $port_primary -c "SELECT slot_name, confirmed_flush_lsn 
FROM pg_replication_slots WHERE slot_name='logicalslot'"

27. Enable the subscription.

 - Now, the walsender will start streaming from lsn1, This will result in 
replay of 'INSERT to tab2 (lsn2)' and data duplication in tab2.

psql -d postgres -p $port_subscriber -c "alter subscription sub enable;"


-----------------------------------
Below are the detailed steps to follow to reproduce the issue on PG13:
(Note: Since these steps are intended to be run manually, short delays like 
sleep 1 between steps are assumed and not explicitly mentioned. Any wait time 
longer than one second is explicitly called out.)
--------------------

1. Set up the primary and subscriber nodes with the same configurations as 
shared in reproduce_data_duplicate_without_twophase.sh. (The script can be used 
to do the initial setup)

2. On Primary: Create table tab1, insert a value and create a publication

psql -d postgres -p $port_primary -c "CREATE TABLE tab1(a int); INSERT INTO 
tab1 VALUES(1); CREATE PUBLICATION pub FOR TABLE tab1;"

3. On Subscriber: Create the same table tab1

psql -d postgres -p $port_subscriber -c "CREATE TABLE tab1(a int);"

4. On Subscriber: Start the subscription with copy_data to false

psql -d postgres -p $port_subscriber -c "CREATE SUBSCRIPTION sub CONNECTION 
'dbname=postgres port=$port_primary' PUBLICATION pub WITH 
(slot_name='logicalslot', create_slot=true, copy_data = false, enabled=true)"

5. Primary: Confirm the slot details

psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, 
confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'"

6. Insert the data into tab1. The apply worker's origin_lsn and slot's 
confirmed_flush will be advanced to this INSERT lsn (say lsn1)

psql -d postgres -p $port_primary -c "INSERT INTO tab1 VALUES(2);"

7. Check both confirmed_flush and origin_lsn values, both values should now 
match the LSN of the insert above (lsn1).
psql -d postgres -p $port_subscriber -c "select * from 
pg_replication_origin_status where local_id = 1;"
psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, 
confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'"

8. Add a new table (tab2) to the publication on Primary
psql -d postgres -p $port_primary -c "CREATE TABLE tab2 (a int UNIQUE); ALTER 
PUBLICATION pub ADD TABLE tab2;"

9. Create tab2 on Subscriber
psql -d postgres -p $port_subscriber -c "CREATE TABLE tab2 (a int UNIQUE);"

10. Refresh the subscription. It will start tablesync for tab2.

psql -d postgres -p $port_subscriber -c "ALTER SUBSCRIPTION sub REFRESH 
PUBLICATION"

11. Attach debugger to the tablesync worker and hold it just before it sets the 
state to SUBREL_STATE_SYNCWAIT.


12. On Primary: Insert a row into tab2. Lets say the remote lsn for this change 
is lsn2.

psql -d postgres -p $port_primary -c "INSERT INTO tab2 VALUES(2);"

13. Wait for 3+ seconds. The above insert will not be consumed by tablesync 
worker on sub yet. Apply worker will see this change and will ignore it.

14. Check that confirmed_flush has moved to lsn2 now (where lsn2 > lsn1 ) due 
to keepalive message handling in apply worker. And origin_lsn remains unchanged.

psql -d postgres -p $port_subscriber -c "select * from 
pg_replication_origin_status where local_id = 1;"
psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, 
confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'"


15. Attach another debugger to apply-worker process and hold it just after 
maybe_reread_subscription() call but before process_syncing_tables().
 
16. At tablesync debugger: release worker's hold, continue it. It will wait for 
apply worker to move the state to SUBREL_STATE_CATCHUP to proceed.
 
17. At apply worker debugger: release hold and let it finish 
process_syncing_tables(), it will move the state to SUBREL_STATE_CATCHUP. BUT 
hold it again just after process_syncing_tables() finishes.

18. Concurrently, before the tablesync catches up, attach the debugger to 
tablesync worker again, wait till process_syncing_tables_for_sync() call to hit 
and hold it just before it sets the state to SUBREL_STATE_SYNCDONE.
Now, the tablesync will catchup (consume the data inserted above in tab2) and 
should wait in debugger just before setting the state to SUBREL_STATE_SYNCDONE.
 
19. Release the tablesync worker from debugger
- It will now move to  SUBREL_STATE_SYNCDONE state.

Note: apply worker is on hold just after process_syncing_tables(), so will not 
move the state to READY yet.
 
20. Disable the sub
psql -d postgres -p $port_subscriber -c "alter subscription sub disable;"
 
21. Release the apply worker from debugger.
- It will exit due to sub being disabled. Tablesync will also exit here.
 
22. Enable the subscription again and let the apply worker start.
- Apply-worker will move the state to SUBREL_STATE_READY.
 
psql -d postgres -p $port_subscriber -c "alter subscription sub enable;"

-----------
Wait here for 3+ seconds, now the state is: 
 --table sync is finished on sub, changes are synced upton lsn2
 --apply worker has processed and ignored the changes upto lsn2 without 
updating origin_lsn
 --apply worker's origin_lsn at sub is still lsn1
 --confirmed_flush on pub is at lsn2
-----------

23. Check the lsn values

psql -d postgres -p $port_primary -c "SELECT slot_name, restart_lsn, 
confirmed_flush_lsn FROM pg_replication_slots WHERE slot_name='logicalslot'"

24. Disable sub

psql -d postgres -p $port_subscriber -c "alter subscription sub disable;"

25. Re-enable the sub and attach debugger to walsender(Primary) and hold it 
just before ProcessRepliesIfAny(). This will stop it from processing any more 
replies or send any more keepalive. 

psql -d postgres -p $port_subscriber -c "alter subscription sub enable;"

 -- Due to lack of any message from walsender, let apply worker send feedback 
with flush position as lsn1 (origin_lsn). But this will only be processed by 
walsender after we detach it from debugger.


26. Check origin is still at lsn1 and confirmed_flush at lsn2
psql -d postgres -p $port_subscriber -c "select * from 
pg_replication_origin_status where local_id = 1;"
psql -d postgres -p $port_primary -c "SELECT slot_name, confirmed_flush_lsn 
FROM pg_replication_slots WHERE slot_name='logicalslot'"

27. Disable the subscription
psql -d postgres -p $port_subscriber -c "alter subscription sub disable;"

28. Detach the debugger from walsender process
-  Before exit, it will process the reply from apply worker and move the 
confirmed_flush to lsn1.

29. Check confirmed_flush is now moved to lsn1

psql -d postgres -p $port_primary -c "SELECT slot_name, confirmed_flush_lsn 
FROM pg_replication_slots WHERE slot_name='logicalslot'"

30. Enable the subscription.

 - Now, the walsender will start streaming from lsn1, This will result in 
replay of 'INSERT to tab2 (lsn2)' and data duplication in tab2.

psql -d postgres -p $port_subscriber -c "alter subscription sub enable;"


-----------------------------------

Reply via email to