Re: 001_rep_changes.pl stalls

Noah Misch Sat, 18 Apr 2020 00:02:13 -0700

On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
> At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi 
> <horikyota....@gmail.com> wrote in 
> > By the way, if latch is consumed in WalSndLoop, succeeding call to
> > WalSndWaitForWal cannot be woke-up by the latch-set.  Doesn't that
> > cause missing wakeups? (in other words, overlooking of wakeup latch).
> 
> - Since the only source other than timeout of walsender wakeup is latch,
> - we should avoid wasteful consuming of latch. (It is the same issue
> - with [1]).
> 
> + Since walsender is wokeup by LSN advancement via latch, we should
> + avoid wasteful consuming of latch. (It is the same issue with [1]).
> 
> 
> > If wakeup signal is not remembered on walsender (like
> > InterruptPending), WalSndPhysical cannot enter a sleep with
> > confidence.


No; per latch.h, "What must be avoided is placing any checks for asynchronous
events after WaitLatch and before ResetLatch, as that creates a race
condition."  In other words, the thing to avoid is calling ResetLatch()
without next examining all pending work that a latch would signal.  Each
walsender.c WaitLatch call does follow the rules.

On Sat, Apr 18, 2020 at 12:29:58AM +0900, Fujii Masao wrote:
> On 2020/04/17 14:41, Noah Misch wrote:
> >1. Make XLogSendPhysical() more like XLogSendLogical(), calling
> >    WalSndWaitForWal() when no WAL is available.  A quick version of this
> >    passes tests, but I'll need to audit WalSndWaitForWal() for things that 
> > are
> >    wrong for physical replication.
> 
> (1) makes even physical replication walsender sleep in two places and
> which seems to make the code for physical replication complicated
> more than necessary. I'd like to avoid (1) if possible.

Good point.

> >2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
> >    insufficient WAL is available.  This complicates the xlogreader.h API to
> >    pass back "wait for this XLogRecPtr", and we'd then persist enough state 
> > to
> >    resume decoding.  This doesn't have any advantages to make up for those.
> >
> >3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
> >    WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop().  This 
> > risks
> >    further drift between the two wait sites; on the other hand, one could
> >    refactor later to help avoid that.
> 
> Since the additional call of WalSndKeepalive() is necessary only for
> logical replication, it should be copied to, e.g., XLogSendLogical(),
> instead of WalSndLoop()? For example, when XLogSendLogical() sets
> WalSndCaughtUp to true, it should call WalSndKeepalive()?

We'd send a keepalive even when pq_flush_if_writable() can't empty the output
buffer.  That could be acceptable, but it's not ideal.

> The root problem seems that when WAL record that's no-opes for
> logical rep is processed, keep alive message has not sent immediately,
> in spite of that we want pg_stat_replication to be updated promptly.

The degree of promptness should be predictable, at least.  If we removed the
WalSndKeepalive() from WalSndWaitForWal(), pg_stat_replication updates would
not be prompt, but they would be predictable.  I do, however, think prompt
updates are worthwhile.

> (3) seems to try to address this problem straightly and looks better to me.
> 
> >4. Keep the WalSndLoop() wait, but condition it on !logical.  This is the
> >    minimal fix, but it crudely punches through the abstraction between
> >    WalSndLoop() and its WalSndSendDataCallback.
> 
> (4) also looks good because it's simple, if we can redesign those
> functions in good shape.

Let's do that.  I'm attaching the replacement implementation and the revert of
v1.

Author:     Noah Misch <n...@leadboat.com>
Commit:     Noah Misch <n...@leadboat.com>

    Revert "When WalSndCaughtUp, sleep only in WalSndWaitForWal()."
    
    This reverts commit 421685812290406daea58b78dfab0346eb683bbb.  It caused
    idle physical walsenders to busy-wait, as reported by Fujii Masao.
    
    Discussion: https://postgr.es/m/20200417054146.ga1061...@rfd.leadboat.com

diff --git a/src/backend/replication/walsender.c 
b/src/backend/replication/walsender.c
index fc475d1..122d884 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1428,10 +1428,8 @@ WalSndWaitForWal(XLogRecPtr loc)
                /*
                 * We only send regular messages to the client for full decoded
                 * transactions, but a synchronous replication and walsender 
shutdown
-                * possibly are waiting for a later location. So, before 
sleeping, we
-                * send a ping containing the flush location. If the receiver is
-                * otherwise idle, this keepalive will trigger a reply. 
Processing the
-                * reply will update these MyWalSnd locations.
+                * possibly are waiting for a later location. So we send pings
+                * containing the flush location every now and then.
                 */
                if (MyWalSnd->flush < sentPtr &&
                        MyWalSnd->write < sentPtr &&
@@ -2316,16 +2314,20 @@ WalSndLoop(WalSndSendDataCallback send_data)
                WalSndKeepaliveIfNecessary();
 
                /*
-                * Block if we have unsent data.  Let WalSndWaitForWal() handle 
any
-                * other blocking; idle receivers need its additional actions.
+                * We don't block if not caught up, unless there is unsent data
+                * pending in which case we'd better block until the socket is
+                * write-ready.  This test is only needed for the case where the
+                * send_data callback handled a subset of the available data 
but then
+                * pq_flush_if_writable flushed it all --- we should 
immediately try
+                * to send more.
                 */
-               if (pq_is_send_pending())
+               if ((WalSndCaughtUp && !streamingDoneSending) || 
pq_is_send_pending())
                {
                        long            sleeptime;
                        int                     wakeEvents;
 
                        wakeEvents = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | 
WL_TIMEOUT |
-                               WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE;
+                               WL_SOCKET_READABLE;
 
                        /*
                         * Use fresh timestamp, not last_processing, to reduce 
the chance
@@ -2333,6 +2335,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
                         */
                        sleeptime = 
WalSndComputeSleeptime(GetCurrentTimestamp());
 
+                       if (pq_is_send_pending())
+                               wakeEvents |= WL_SOCKET_WRITEABLE;
+
                        /* Sleep until something happens or we time out */
                        (void) WaitLatchOrSocket(MyLatch, wakeEvents,
                                                                         
MyProcPort->sock, sleeptime,

Author:     Noah Misch <n...@leadboat.com>
Commit:     Noah Misch <n...@leadboat.com>

    In caught-up logical walsender, sleep only in WalSndWaitForWal().
    
    Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write
    < sentPtr.  When the latest physical LSN yields no logical replication
    messages (a common case), that keepalive elicits a reply.  Processing
    the reply updates pg_stat_replication.replay_lsn.  WalSndLoop() lacks
    that; when WalSndLoop() slept, replay_lsn advancement could stall until
    wal_receiver_status_interval elapsed.  This sometimes stalled
    src/test/subscription/t/001_rep_changes.pl for up to 10s.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/FIXME

diff --git a/src/backend/replication/walsender.c 
b/src/backend/replication/walsender.c
index 9e56115..a880195 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1428,8 +1428,10 @@ WalSndWaitForWal(XLogRecPtr loc)
                /*
                 * We only send regular messages to the client for full decoded
                 * transactions, but a synchronous replication and walsender 
shutdown
-                * possibly are waiting for a later location. So we send pings
-                * containing the flush location every now and then.
+                * possibly are waiting for a later location. So, before 
sleeping, we
+                * send a ping containing the flush location. If the receiver is
+                * otherwise idle, this keepalive will trigger a reply. 
Processing the
+                * reply will update these MyWalSnd locations.
                 */
                if (MyWalSnd->flush < sentPtr &&
                        MyWalSnd->write < sentPtr &&
@@ -2314,14 +2316,14 @@ WalSndLoop(WalSndSendDataCallback send_data)
                WalSndKeepaliveIfNecessary();
 
                /*
-                * We don't block if not caught up, unless there is unsent data
-                * pending in which case we'd better block until the socket is
-                * write-ready.  This test is only needed for the case where the
-                * send_data callback handled a subset of the available data 
but then
-                * pq_flush_if_writable flushed it all --- we should 
immediately try
-                * to send more.
+                * Block if we have unsent data.  XXX For logical replication, 
let
+                * WalSndWaitForWal(), handle any other blocking; idle 
receivers need
+                * its additional actions.  For physical replication, also 
block if
+                * caught up; its send_data does not block.
                 */
-               if ((WalSndCaughtUp && !streamingDoneSending) || 
pq_is_send_pending())
+               if ((WalSndCaughtUp && send_data != XLogSendLogical &&
+                        !streamingDoneSending) ||
+                       pq_is_send_pending())
                {
                        long            sleeptime;
                        int                     wakeEvents;

Re: 001_rep_changes.pl stalls

Reply via email to