On Mon, Jan 16, 2023 at 4:39 PM Hayato Kuroda (Fujitsu) <kuroda.hay...@fujitsu.com> wrote: > > > In logical replication apply preceeds write and flush so we have no > > indication whether a record is "replicated" to standby by other than > > apply LSN. On the other hand, logical recplication doesn't have a > > business with switchover so that assurarance is useless. Thus I think > > we can (practically) ignore apply_lsn at shutdown. It seems subtly > > irregular, though. > > Another consideration is that the condition (!pq_is_send_pending()) ensures > that > there are no pending messages, including other packets. Currently we force > walsenders > to clean up all messages before shutting down, even if it is a keepalive one. > I cannot have any problems caused by this, but I can keep the condition in > case of > logical replication. >
Let me try to summarize the discussion till now. The problem we are trying to solve here is to allow a shutdown to complete when walsender is not able to send the entire WAL. Currently, in such cases, the shutdown fails. As per our current understanding, this can happen when (a) walreceiver/walapply process is stuck (not able to receive more WAL) due to locks or some other reason; (b) a long time delay has been configured to apply the WAL (we don't yet have such a feature for logical replication but the discussion for same is in progress). Both reasons mostly apply to logical replication because there is no separate walreceiver process whose job is to just flush the WAL. In logical replication, the process that receives the WAL also applies it. So, while applying it can stuck for a long time waiting for some heavy-weight lock to be released by some other long-running transaction by the backend. Similarly, if the user has configured a large value of time-delayed apply, it can lead to a network buffer full between walsender and receive/process. The condition to allow the shutdown to wait for all WAL to be sent has two parts: (a) it confirms that there is no pending WAL to be sent; (b) it confirms all the WAL sent has been flushed by the client. As per our understanding, both these conditions are to allow clean switchover/failover which seems to be useful only for physical replication. The logical replication doesn't provide such functionality. The proposed patch tries to eliminate condition (b) for logical replication in the hopes that the same will allow the shutdown to be complete in most cases. There is no specific reason discussed to not do (a) for logical replication. Now, to proceed here we have the following options: (1) Fix (b) as proposed by the patch and document the risks related to (a); (2) Fix both (a) and (b); (3) Do nothing and document that users need to unblock the subscribers to complete the shutdown. Thoughts? -- With Regards, Amit Kapila.