Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is
important for promptly updating pg_stat_replication.  When caught up, we
should impose that logic before every sleep.  The one-line fix is to sleep in
WalSndLoop() only when pq_is_send_pending(), not when caught up.


On my regular development machine, src/test/subscription/t/001_rep_changes.pl
stalls for ~10s at this wait_for_catchup:

  $node_publisher->safe_psql('postgres', "DELETE FROM tab_rep");

  # Restart the publisher and check the state of the subscriber which
  # should be in a streaming state after catching up.
  $node_publisher->stop('fast');
  $node_publisher->start;

  $node_publisher->wait_for_catchup('tap_sub');

That snippet emits three notable physical WAL records.  There's a
Transaction/COMMIT at the end of the DELETE, an XLOG/CHECKPOINT_SHUTDOWN, and
an XLOG/FPI_FOR_HINT.

The buildfarm has stalled there, but it happens probably less than half the
time.  Examples[1] showing the stall:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mandrill&dt=2020-03-20%2017%3A09%3A53&stg=subscription-check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=thorntail&dt=2020-03-22%2019%3A51%3A38&stg=subscription-check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-19%2003%3A35%3A01&stg=subscription-check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-20%2015%3A15%3A01&stg=subscription-check

Here's the most-relevant walsender call tree:

WalSndLoop
  XLogSendLogical (caller invokes once per loop iteration, via send_data 
callback)
    XLogReadRecord (caller invokes once)
      ReadPageInternal (caller invokes twice in this test; more calls are 
possible)
        logical_read_xlog_page (caller skips when page is same as last call, 
else invokes 1-2 times via state->read_page() callback, registered in 
StartLogicalReplication)
          WalSndWaitForWal (caller invokes once; has fast path)


The cause is a race involving the flow of reply messages (send_feedback()
messages) from logical apply worker to walsender.  Here are two sequencing
patterns; the more-indented parts are what differ.  Stalling pattern:

  sender reads Transaction/COMMIT and sends the changes
  receiver applies the changes
  receiver send_feedback() reports progress up to Transaction/COMMIT
    sender accepts the report
    sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are 
no-ops for logical rep
    sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop()
    receiver wal_receiver_status_interval elapses; receiver reports progress up 
to Transaction/COMMIT
  sender wakes up, accepts the report
  sender calls WalSndWaitForWal(), which sends a keepalive due to 
"MyWalSnd->write < sentPtr"
  receiver gets keepalive, send_feedback() reports progress up to 
XLOG/FPI_FOR_HINT

Non-stalling pattern (more prevalent with lower machine performance):

  sender reads Transaction/COMMIT and sends the changes
    sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are 
no-ops for logical rep
    sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop()
  receiver applies the changes
  receiver send_feedback() reports progress up to Transaction/COMMIT
  sender wakes up, accepts the report
  sender calls WalSndWaitForWal(), which sends a keepalive due to 
"MyWalSnd->write < sentPtr"
  receiver gets keepalive, send_feedback() reports progress up to 
XLOG/FPI_FOR_HINT


The fix is to test "MyWalSnd->write < sentPtr" before more sleeps.  The test
is unnecessary when sleeping due to pq_is_send_pending(); in that case, the
receiver is not idle and will reply before idling.  I changed WalSndLoop() to
sleep only for pq_is_send_pending().  For all other sleep reasons, the sleep
will happen in WalSndWaitForWal().  Attached.  I don't know whether this is
important outside of testing scenarios.  I lean against back-patching, but I
will back-patch if someone thinks this qualifies as a performance bug.

Thanks,
nm

[1] I spot-checked only my animals, since I wanted to experiment on an
affected animal.
Author:     Noah Misch <n...@leadboat.com>
Commit:     Noah Misch <n...@leadboat.com>

    When WalSndCaughtUp, sleep only in WalSndWaitForWal().
    
    Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write
    < sentPtr.  That is important in logical replication.  When the latest
    physical LSN yields no logical replication messages (a common case),
    that keepalive elicits a reply, and processing the reply updates
    pg_stat_replication.replay_lsn.  WalSndLoop() lacks that; when
    WalSndLoop() slept, replay_lsn advancement could stall until
    wal_receiver_status_interval elapsed.  This sometimes stalled
    src/test/subscription/t/001_rep_changes.pl for up to 10s.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/FIXME

diff --git a/src/backend/replication/walsender.c 
b/src/backend/replication/walsender.c
index 9e56115..d9c6359 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1428,8 +1428,10 @@ WalSndWaitForWal(XLogRecPtr loc)
                /*
                 * We only send regular messages to the client for full decoded
                 * transactions, but a synchronous replication and walsender 
shutdown
-                * possibly are waiting for a later location. So we send pings
-                * containing the flush location every now and then.
+                * possibly are waiting for a later location. So, before 
sleeping, we
+                * send a ping containing the flush location. If the receiver is
+                * otherwise idle, this keepalive will trigger a reply. 
Processing the
+                * reply will update these MyWalSnd locations.
                 */
                if (MyWalSnd->flush < sentPtr &&
                        MyWalSnd->write < sentPtr &&
@@ -2314,20 +2316,16 @@ WalSndLoop(WalSndSendDataCallback send_data)
                WalSndKeepaliveIfNecessary();
 
                /*
-                * We don't block if not caught up, unless there is unsent data
-                * pending in which case we'd better block until the socket is
-                * write-ready.  This test is only needed for the case where the
-                * send_data callback handled a subset of the available data 
but then
-                * pq_flush_if_writable flushed it all --- we should 
immediately try
-                * to send more.
+                * Block if we have unsent data.  Let WalSndWaitForWal() handle 
any
+                * other blocking; idle receivers need its additional actions.
                 */
-               if ((WalSndCaughtUp && !streamingDoneSending) || 
pq_is_send_pending())
+               if (pq_is_send_pending())
                {
                        long            sleeptime;
                        int                     wakeEvents;
 
                        wakeEvents = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | 
WL_TIMEOUT |
-                               WL_SOCKET_READABLE;
+                               WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE;
 
                        /*
                         * Use fresh timestamp, not last_processing, to reduce 
the chance
@@ -2335,9 +2333,6 @@ WalSndLoop(WalSndSendDataCallback send_data)
                         */
                        sleeptime = 
WalSndComputeSleeptime(GetCurrentTimestamp());
 
-                       if (pq_is_send_pending())
-                               wakeEvents |= WL_SOCKET_WRITEABLE;
-
                        /* Sleep until something happens or we time out */
                        (void) WaitLatchOrSocket(MyLatch, wakeEvents,
                                                                         
MyProcPort->sock, sleeptime,

Reply via email to