Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is important for promptly updating pg_stat_replication. When caught up, we should impose that logic before every sleep. The one-line fix is to sleep in WalSndLoop() only when pq_is_send_pending(), not when caught up.
On my regular development machine, src/test/subscription/t/001_rep_changes.pl stalls for ~10s at this wait_for_catchup: $node_publisher->safe_psql('postgres', "DELETE FROM tab_rep"); # Restart the publisher and check the state of the subscriber which # should be in a streaming state after catching up. $node_publisher->stop('fast'); $node_publisher->start; $node_publisher->wait_for_catchup('tap_sub'); That snippet emits three notable physical WAL records. There's a Transaction/COMMIT at the end of the DELETE, an XLOG/CHECKPOINT_SHUTDOWN, and an XLOG/FPI_FOR_HINT. The buildfarm has stalled there, but it happens probably less than half the time. Examples[1] showing the stall: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mandrill&dt=2020-03-20%2017%3A09%3A53&stg=subscription-check https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=thorntail&dt=2020-03-22%2019%3A51%3A38&stg=subscription-check https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-19%2003%3A35%3A01&stg=subscription-check https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-20%2015%3A15%3A01&stg=subscription-check Here's the most-relevant walsender call tree: WalSndLoop XLogSendLogical (caller invokes once per loop iteration, via send_data callback) XLogReadRecord (caller invokes once) ReadPageInternal (caller invokes twice in this test; more calls are possible) logical_read_xlog_page (caller skips when page is same as last call, else invokes 1-2 times via state->read_page() callback, registered in StartLogicalReplication) WalSndWaitForWal (caller invokes once; has fast path) The cause is a race involving the flow of reply messages (send_feedback() messages) from logical apply worker to walsender. Here are two sequencing patterns; the more-indented parts are what differ. Stalling pattern: sender reads Transaction/COMMIT and sends the changes receiver applies the changes receiver send_feedback() reports progress up to Transaction/COMMIT sender accepts the report sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are no-ops for logical rep sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop() receiver wal_receiver_status_interval elapses; receiver reports progress up to Transaction/COMMIT sender wakes up, accepts the report sender calls WalSndWaitForWal(), which sends a keepalive due to "MyWalSnd->write < sentPtr" receiver gets keepalive, send_feedback() reports progress up to XLOG/FPI_FOR_HINT Non-stalling pattern (more prevalent with lower machine performance): sender reads Transaction/COMMIT and sends the changes sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are no-ops for logical rep sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop() receiver applies the changes receiver send_feedback() reports progress up to Transaction/COMMIT sender wakes up, accepts the report sender calls WalSndWaitForWal(), which sends a keepalive due to "MyWalSnd->write < sentPtr" receiver gets keepalive, send_feedback() reports progress up to XLOG/FPI_FOR_HINT The fix is to test "MyWalSnd->write < sentPtr" before more sleeps. The test is unnecessary when sleeping due to pq_is_send_pending(); in that case, the receiver is not idle and will reply before idling. I changed WalSndLoop() to sleep only for pq_is_send_pending(). For all other sleep reasons, the sleep will happen in WalSndWaitForWal(). Attached. I don't know whether this is important outside of testing scenarios. I lean against back-patching, but I will back-patch if someone thinks this qualifies as a performance bug. Thanks, nm [1] I spot-checked only my animals, since I wanted to experiment on an affected animal.
Author: Noah Misch <n...@leadboat.com> Commit: Noah Misch <n...@leadboat.com> When WalSndCaughtUp, sleep only in WalSndWaitForWal(). Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write < sentPtr. That is important in logical replication. When the latest physical LSN yields no logical replication messages (a common case), that keepalive elicits a reply, and processing the reply updates pg_stat_replication.replay_lsn. WalSndLoop() lacks that; when WalSndLoop() slept, replay_lsn advancement could stall until wal_receiver_status_interval elapsed. This sometimes stalled src/test/subscription/t/001_rep_changes.pl for up to 10s. Reviewed by FIXME. Discussion: https://postgr.es/m/FIXME diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 9e56115..d9c6359 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1428,8 +1428,10 @@ WalSndWaitForWal(XLogRecPtr loc) /* * We only send regular messages to the client for full decoded * transactions, but a synchronous replication and walsender shutdown - * possibly are waiting for a later location. So we send pings - * containing the flush location every now and then. + * possibly are waiting for a later location. So, before sleeping, we + * send a ping containing the flush location. If the receiver is + * otherwise idle, this keepalive will trigger a reply. Processing the + * reply will update these MyWalSnd locations. */ if (MyWalSnd->flush < sentPtr && MyWalSnd->write < sentPtr && @@ -2314,20 +2316,16 @@ WalSndLoop(WalSndSendDataCallback send_data) WalSndKeepaliveIfNecessary(); /* - * We don't block if not caught up, unless there is unsent data - * pending in which case we'd better block until the socket is - * write-ready. This test is only needed for the case where the - * send_data callback handled a subset of the available data but then - * pq_flush_if_writable flushed it all --- we should immediately try - * to send more. + * Block if we have unsent data. Let WalSndWaitForWal() handle any + * other blocking; idle receivers need its additional actions. */ - if ((WalSndCaughtUp && !streamingDoneSending) || pq_is_send_pending()) + if (pq_is_send_pending()) { long sleeptime; int wakeEvents; wakeEvents = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT | - WL_SOCKET_READABLE; + WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE; /* * Use fresh timestamp, not last_processing, to reduce the chance @@ -2335,9 +2333,6 @@ WalSndLoop(WalSndSendDataCallback send_data) */ sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp()); - if (pq_is_send_pending()) - wakeEvents |= WL_SOCKET_WRITEABLE; - /* Sleep until something happens or we time out */ (void) WaitLatchOrSocket(MyLatch, wakeEvents, MyProcPort->sock, sleeptime,