Dear Dilip, hackers, Thanks for giving your opinion. I analyzed the relation with the given commit, and I thought I could keep my patch. How do you think?
# Abstract * Some modifications should be needed. * We cannot rollback the shutdown if walsenders are stuck * We don't have a good way to detect stuck # Discussion Compared to physical replication, it is likely to happen that logical replication is stuck. I think the risk should be avoided as much as possible by fixing codes. Then, if it leads another failure, we can document the caution to users. While shutting down the server, checkpointer sends SIGUSR1 signal to wansenders. It is done after being exited other processes, so we cannot raise ERROR and rollback the operation if checkpointer recognize the process stuck at that time. We don't have any features that postmaster can check whether this node is publisher or not. So if we want to add the mechanism that can check the health of walsenders before shutting down, we must do that at the top of process_pm_shutdown_request() even if we are not in logical replication. I think it affects the basis of postgres largely, and in the first place, PostgreSQL does not have a mechanism to check the health of process. Therefore, I want to adopt the approach that walsender itself exits immediately when they get signals. ## About patch - Were fixes correct? In ProcessPendingWrites(), my patch, wansender calls WalSndDone() when it gets SIGUSR1 signal. I think this should be. From the patch [1]: ``` @@ -1450,6 +1450,10 @@ ProcessPendingWrites(void) /* Try to flush pending output to the client */ if (pq_flush_if_writable() != 0) WalSndShutdown(); + + /* If we got shut down requested, try to exit the process */ + if (got_STOPPING) + WalSndDone(XLogSendLogical); } /* reactivate latch so WalSndLoop knows to continue */ ``` Per my analysis, in case of logical replication, walsenders exit with following steps. Note that logical walsender does not receive SIGUSR2 signal, set flag by themselves instead: 1. postmaster sends shutdown requests to checkpointer 2. checkpointer sends SIGUSR1 to walsenders and wait 3. when walsenders accept SIGUSR1, they turn got_SIGUSR1 on. 4. walsenders consume all WALs. @XLogSendLogical 5. walsenders turn got_SIGUSR2 on by themselves @XLogSendLogical 6. walsenders recognize the flag is on, so call WalSndDone() @ WalSndLoop 7. proc_exit(0) 8. checkpoitner writes shutdown record ... Type (i) stuck, I reported in -hackers[1], means that processes stop at step 6 and Type (ii) stuck means that processes stop at 4. In step4, got_SIGUSR2 is never set to on, so we must use got_STOPPING flag. [1]: https://www.postgresql.org/message-id/tycpr01mb58701a47f35fed0a2b399662f5...@tycpr01mb5870.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED