Hi Vignesh, > In the case of physical replication by setting > recovery_min_apply_delay, I noticed that both primary and standby > nodes were getting stopped successfully immediately after the stop > server command. In case of logical replication, stop server fails: > pg_ctl -D publisher -l publisher.log stop -c > waiting for server to shut > down............................................................... > failed > pg_ctl: server does not shut down > > In case of logical replication, the server does not get stopped > because the walsender process is not able to exit: > ps ux | grep walsender > vignesh 1950789 75.3 0.0 8695216 22284 ? Rs 11:51 1:08 > postgres: walsender vignesh [local] START_REPLICATION
Thanks for reporting the issue. I analyzed about it. This issue has occurred because the apply worker cannot reply during the delay. I think we may have to modify the mechanism that delays applying transactions. When walsender processes are requested to shut down, it can shut down only after that all the sent WALs are replicated on the subscriber. This check is done in WalSndDone(), and the replicated position will be updated when processes handle the reply messages from a subscriber, in ProcessStandbyReplyMessage(). In the case of physical replication, the walreciever can receive WALs and reply even if the application is delayed. It means that the replicated position will be transported to the publisher side immediately. So the walsender can exit. In terms of logical replication, however, the worker cannot reply to the walsender while delaying the transaction with this patch at present. It causes the replicated position to be never transported upstream and the walsender cannot exit. Based on the above analysis, we can conclude that the worker must update the flushpos and reply to the walsender while delaying the transaction if we want to solve the issue. This cannot be done in the current approach, and a newer proposed one[1] may be able to solve this, although it's currently under discussion. Note that a similar issue can reproduce while doing the physical replication. When the wal_sender_timeout is set to 0 and the network between primary and secondary is broken after that primary sends WALs to secondary, we cannot stop the primary node. [1]: https://www.postgresql.org/message-id/TYCPR01MB8373FA10EB2DB2BF8E458604ED1B9%40TYCPR01MB8373.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED