On Mon, Sep 25, 2023 at 7:46 PM Hayato Kuroda (Fujitsu) <kuroda.hay...@fujitsu.com> wrote: > > Dear Ajin, Shveta, > > Thank you for rebasing the patch set! Here are new comments for v19_2-0001. >
Thank You Kuroda-san for the feedback. Most of these are addressed in v20. Please find my response inline. > 01. WalSndWaitForStandbyNeeded() > > ``` > if (SlotIsPhysical(MyReplicationSlot)) > return false; > ``` > > Is there a possibility that physical walsenders call this function? > IIUC following is a stacktrace for the function, so the only logical > walsenders use it. > If so, it should be Assert() instead of an if statement. > > logical_read_xlog_page() > WalSndWaitForWal() > WalSndWaitForStandbyNeeded() It will only be called from logical-walsenders. Modified as you suggested. > > 02. WalSndWaitForStandbyNeeded() > > Can we set shouldwait in SlotSyncInitConfig()? synchronize_slot_names_list is > searched whenever the function is called, but it is not changed automatically. > If the slotname is compared with the list in the SlotSyncInitConfig(), the > liner search can be reduced. standby_slot_names and synchronize_slot_names will be removed in the next version as per discussion in [1] and thus SlotSyncInitConfig() will not be needed. It will be replaced by new functionality. So I am currently leaving it as is. > > 03. WalSndWaitForStandbyConfirmation() > > We should add ProcessRepliesIfAny() during the loop, otherwise the walsender > overlooks the death of an apply worker. > Done. > 04. WalSndWaitForStandbyConfirmation() > > Not sure, but do we have to return early if walsenders got > PROCSIG_WALSND_INIT_STOPPING > signal? I thought that if physical walsenders get stuck, logical walsenders > wait > forever. At that time we cannot stop the primary server even if "pg_ctl stop" > is executed. > yes, right. I have added CHECK_FOR_INTERRUPTS() and 'got_STOPPING' handling now which I think should suffice to process PROCSIG_WALSND_INIT_STOPPING. > 05. SlotSyncInitConfig() > > Why don't we free the memory for rawname, old standby_slot_names_list, and > synchronize_slot_names_list? > They seem to be overwritten. > Skipped for the time being due to reasons stated in pt 2. > 06. SlotSyncInitConfig() > > Both physical and logical walsenders call the func, but physical one do not > use > lists, right? If so, can we add a quick exit for physical walsenders? > Or, we should carefully remove where physical calls it. > > 07. StartReplication() > > I think we do not have to call SlotSyncInitConfig(). > Alternative approach is written in above. > I have removed it from StartReplication() > 08. the other > > Also, I found the unexpected behavior after both 0001 and 0002 were applied. > Was it normal or not? > > 1. constructed below setup > (ensured that logical slot existed on secondary) > 2. stopped the primary > 3. promoted the secondary server > 4. disabled a subscription once > 5. changed the connection string for subscriber > 6. Inserted data to new primary > 7. enabled the subscription again > 8. got an ERROR: replication slot "sub" does not exist > > I expected that the logical replication would be restarted, but it could not. > Was it real issue or my fault? The error would appear in secondary.log. > > ``` > Setup: > primary--->secondary > | > | > subscriber > ``` I have attached the new test-script (v2), can you please try that on the v20 set of patches. We should let the slot creation complete first on standby and then try promotion. I have added a few extra lines in v2 of your script for the same. In the test-case, primary's restart-lsn was lagging behind new-slot's restart_lsn on standby and thus standby was waiting for primary to catch-up. Meanwhile standby got promoted and thus slot creation got aborted. That is the reason you were not able to get the logical replication working on the new primary. v20 has improved handling and better logging for such a case. Please try the attached test-script on v20. [1]: https://www.postgresql.org/message-id/CAJpy0uA%2Bt3XP2M0qtEmrOG1gSwHghjHPno5AtwTXM-94-%2Bc6JQ%40mail.gmail.com thanks Shveta
#!/bin/bash port_primary=5431 port_secondary=5432 port_subscriber=5433 echo '==========' echo '=Clean up=' echo '==========' pg_ctl stop -D data_primary pg_ctl stop -D data_secondary pg_ctl stop -D data_subscriber rm -rf data_* *log echo '=======================' echo '=Set up primary server=' echo '=======================' initdb -D data_primary -U postgres cat << EOF >> data_primary/postgresql.conf wal_level = logical port = $port_primary standby_slot_names = 'physical' synchronize_slot_names = 'sub' log_replication_commands = 'on' EOF cat << EOF >> data_primary/pg_hba.conf host all,replication all 0.0.0.0/0 trust EOF pg_ctl -D data_primary start -w -l primary.log psql -U postgres -p $port_primary -c "CREATE TABLE tbl (id int primary key);" psql -U postgres -p $port_primary -c "INSERT INTO tbl VALUES (1)" psql -U postgres -p $port_primary -c "CREATE PUBLICATION pub FOR ALL TABLES" psql -U postgres -p $port_primary -c "SELECT * FROM pg_create_physical_replication_slot('physical');" echo '=========================' echo '=Set up secondary server=' echo '=========================' pg_basebackup -D data_secondary -U postgres -p $port_primary cat << EOF >> data_secondary/postgresql.conf port = $port_secondary primary_conninfo = 'user=postgres port=$port_primary application_name=secondary dbname=postgres' primary_slot_name = 'physical' hot_standby_feedback=on EOF cat << EOF >> data_secondary/standby.signal EOF pg_ctl -D data_secondary start -w -l secondary.log psql -U postgres -p $port_primary -c "SELECT pg_log_standby_snapshot();" psql -U postgres -p $port_secondary -c "SELECT * FROM pg_create_physical_replication_slot('physical');" echo '===================' echo '=Set up subscirber=' echo '===================' initdb -U postgres -D data_subscriber cat << EOF >> data_subscriber/postgresql.conf port = $port_subscriber wal_level = logical EOF pg_ctl start -D data_subscriber -l subscriber.log psql -U postgres -p $port_subscriber -c "CREATE TABLE tbl (id int primary key);" psql -U postgres -p $port_subscriber -c "CREATE SUBSCRIPTION sub CONNECTION 'user=postgres dbname=postgres port=$port_primary' PUBLICATION pub WITH (copy_data = on)" sleep 1s pg_ctl -D data_secondary reload sleep 1s psql -U postgres -p $port_secondary -c "SELECT * FROM pg_replication_slots;" sleep 2s echo 'Moving lsns on primary so that slot-creation on secondary is successful' psql -U postgres -p $port_primary -c "INSERT INTO tbl VALUES (2)" psql -U postgres -p $port_primary -c "INSERT INTO tbl VALUES (3)" psql -U postgres -p $port_primary -c "INSERT INTO tbl VALUES (4)" #Allow sometime for successful slot creation on standby sleep 10s # Now promote... pg_ctl -D data_primary stop -w pg_ctl -D data_secondary promote -w psql -U postgres -p $port_subscriber -c "ALTER SUBSCRIPTION sub DISABLE" pg_ctl -D data_subscriber restart -l subscriber.log psql -U postgres -p $port_secondary -c "INSERT INTO tbl VALUES (generate_series(21, 30))" psql -U postgres -p $port_subscriber -c "ALTER SUBSCRIPTION sub CONNECTION 'user=postgres dbname=postgres port=$port_secondary'" psql -U postgres -p $port_subscriber -c "ALTER SUBSCRIPTION sub ENABLE"