RE: Time delayed LR (WAS Re: logical replication restrictions)

Hayato Kuroda (Fujitsu) Thu, 08 Dec 2022 21:20:01 -0800

Hi Vignesh,

> In the case of physical replication by setting
> recovery_min_apply_delay, I noticed that both primary and standby
> nodes were getting stopped successfully immediately after the stop
> server command. In case of logical replication, stop server fails:
> pg_ctl -D publisher -l publisher.log stop -c
> waiting for server to shut
> down...............................................................
> failed
> pg_ctl: server does not shut down
> 
> In case of logical replication, the server does not get stopped
> because the walsender process is not able to exit:
> ps ux | grep walsender
> vignesh  1950789 75.3  0.0 8695216 22284 ?       Rs   11:51   1:08
> postgres: walsender vignesh [local] START_REPLICATION

Thanks for reporting the issue. I analyzed about it.

This issue has occurred because the apply worker cannot reply during the delay.
I think we may have to modify the mechanism that delays applying transactions.

When walsender processes are requested to shut down, it can shut down only after
that all the sent WALs are replicated on the subscriber. This check is done in
WalSndDone(), and the replicated position will be updated when processes handle
the reply messages from a subscriber, in ProcessStandbyReplyMessage().

In the case of physical replication, the walreciever can receive WALs and reply
even if the application is delayed. It means that the replicated position will
be transported to the publisher side immediately. So the walsender can exit.

In terms of logical replication, however, the worker cannot reply to the
walsender while delaying the transaction with this patch at present. It causes
the replicated position to be never transported upstream and the walsender
cannot
exit.

Based on the above analysis, we can conclude that the worker must update the
flushpos and reply to the walsender while delaying the transaction if we want
to solve the issue. This cannot be done in the current approach, and a newer
proposed one[1] may be able to solve this, although it's currently under
discussion.

Note that a similar issue can reproduce while doing the physical replication.
When the wal_sender_timeout is set to 0 and the network between primary and
secondary is broken after that primary sends WALs to secondary, we cannot stop
the primary node.

[1]:
https://www.postgresql.org/message-id/TYCPR01MB8373FA10EB2DB2BF8E458604ED1B9%40TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Time delayed LR (WAS Re: logical replication restrictions)

Reply via email to