On Tue, Mar 24, 2026 at 3:00 PM Fujii Masao <[email protected]> wrote: > > On Tue, Mar 24, 2026 at 1:01 PM Nisha Moond <[email protected]> wrote: > > Hi Fujii-san, > > > > I tried reproducing the wait scenario as you mentioned, but could not > > reproduce it. > > Steps I followed: > > 1) Place a debugger in the slotsync worker and hold it at > > fetch_remote_slots() ... -> libpqsrv_get_result() > > 2) Kill the primary. > > 3) Triggered promotion of the standby and release debugger from slotsync > > worker. > > > > The slot sync worker stops when the promotion is triggered and then > > restarts, but fails to connect to the primary. The promotion happens > > immediately. > > ``` > > LOG: received promote request > > LOG: redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system: > > 0.02 s, elapsed: 4574.89 s > > LOG: last completed transaction was at log time 2026-03-23 > > 17:13:15.782313+05:30 > > LOG: replication slot synchronization worker will stop because > > promotion is triggered > > LOG: slot sync worker started > > ERROR: synchronization worker "slotsync worker" could not connect to > > the primary server: connection to server at "127.0.0.1", port 9933 > > failed: Connection refused > > Is the server running on that host and accepting TCP/IP connections? > > ``` > > > > I’ll debug this further to understand it better. > > In the meantime, please let me know if I’m missing any step, or if you > > followed a specific setup/script to reproduce this scenario. > > Thanks for testing! > > If you killed the primary with a signal like SIGTERM, an RST packet might have > been sent to the slotsync worker at that moment. That allowed the worker to > detect the connection loss and exited the wait state, so promotion could > complete as expected. > > To reproduce the issue, you'll need a scenario where the worker cannot detect > the connection loss. For example, you could block network traffic (e.g., with > iptables) between the primary and the slotsync worker. The key is to create > a situation where the worker remains stuck waiting for input for a long time.
Here's one way to reproduce the issue using iptables: ---------------------------------------------------- [Set up slot synchronization environment] initdb -D data --encoding=UTF8 --locale=C cat <<EOF >> data/postgresql.conf wal_level = logical synchronized_standby_slots = 'physical_slot' EOF pg_ctl -D data start pg_receivewal --create-slot -S physical_slot pg_recvlogical --create-slot -S logical_slot -P pgoutput --enable-failover -d postgres psql -c "CREATE PUBLICATION mypub" pg_basebackup -D sby1 -c fast -R -S physical_slot -d "dbname=postgres" -h 127.0.0.1 cat <<EOF >> sby1/postgresql.conf port = 5433 sync_replication_slots = on hot_standby_feedback = on EOF pg_ctl -D sby1 start psql -c "SELECT pg_logical_emit_message(true, 'abc', 'xyz')" [Block network traffic used by slot synchronization] su - iptables -A INPUT -p tcp --sport 5432 -j DROP iptables -A OUTPUT -p tcp --dport 5432 -j DROP [Promote the standby] # wait a few seconds pg_ctl -D sby1 promote ---------------------------------------------------- In my tests on master, promotion got stuck in this scenario. With the patch, promotion completed promptly. After testing, you can remove the network block with: iptables -D INPUT -p tcp --sport 5432 -j DROP iptables -D OUTPUT -p tcp --dport 5432 -j DROP Regards, -- Fujii Masao
