[HACKERS] Apparent walsender bug triggered by logical replication

Tom Lane Thu, 29 Jun 2017 17:07:58 -0700

I've been poking into the src/test/subscription TAP tests, thinking
that they seem a lot slower than they ought to be.  The first thing
I came across was this bit in WaitForReplicationWorkerAttach():


        /*
         * We need timeout because we generally don't get notified via latch
         * about the worker attach.
         */
        rc = WaitLatch(MyLatch,
                       WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                       1000L, WAIT_EVENT_BGWORKER_STARTUP);

Tracing showed that the worker is generally done attaching within two
or three milliseconds of the request, making the 1000ms delay in this
loop a rather serious killer of startup performance.  I think the right
way to fix that is to arrange to have a signal sent back from the worker;
but just to confirm that this *is* a bottleneck, I tried cutting the
timeout way back.  Setting it to 10ms indeed made the 001 and 002
subscription tests a good deal faster --- most of the time.  Every second
or third run or so, either the 001 or 002 test would take about 20sec
longer than it ought to.  When I set it to 5ms, the 001 test reliably took
ten times longer than normal, and the 002 test failed completely.

I've spent quite a bit of time tracing down why, but what seems
to be happening is that elimination of the excess delay in 
WaitForReplicationWorkerAttach allows this if-test in
LogicalRepSyncTableStart:

                /*----------
                 * There are now two possible states here:
                 * a) Sync is behind the apply.  If that's the case we need to
                 *      catch up with it by consuming the logical replication
                 *      stream up to the relstate_lsn.  For that, we exit this
                 *      function and continue in ApplyWorkerMain().
                 * b) Sync is caught up with the apply.  So it can just set
                 *      the state to SYNCDONE and finish.
                 *----------
                 */
                if (*origin_startpos >= MyLogicalRepWorker->relstate_lsn)

to sometimes fail, whereas it basically never does with the delay, because
the launcher never gets there fast enough to advance relstate_lsn past
what the synchronization worker had set before the worker gets here and
decides it can exit.  That means we do what the comment calls (a), that
is, return to ApplyWorkerMain, start up streaming replication, and enter
LogicalRepApplyLoop.  But, almost immediately, we reach the
synchronization point and decide to shut down again.  What happens then is
that the partner walsender gets hung up for tens of seconds, and until it
sends back the "COPY 0" CommandComplete message, the sync worker can't
finish and exit.

What seems to be the reason for the hang is that WalSndWaitForWal() is
being called to ask for WAL that doesn't exist yet, and unlike the other
loops in walsender.c, it contains no provision for exiting early when the
client indicates it's lost interest.  So we just sit there until something
happens on the master server.  Although this situation might not last very
long on production servers, it still seems like a serious bug.

I was able to make the hang go away by means of the attached patch that
allows WalSndWaitForWal to exit early if the client has shut down the
COPY.  However, since that function is miserably underdocumented (like
most of this code :-(), I have little idea if this is correct or safe.

I also wonder why WalSndWaitForWal is being called for WAL that seemingly
doesn't exist yet, and whether that doesn't indicate another bug somewhere
in this stack.

I'm also kind of wondering why the "behind the apply" path out of
LogicalRepSyncTableStart exists at all; as far as I can tell we'd be much
better off if we just let the sync worker exit always as soon as it's done
the initial sync, letting any extra catchup happen later.  The main thing
the current behavior seems to be accomplishing is to monopolize one of the
scarce max_sync_workers_per_subscription slots for the benefit of a single
table, for longer than necessary.  Plus it adds additional complicated
interprocess signaling.

                        regards, tom lane

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f845180..2611d62 100644
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
*************** WalSndWaitForWal(XLogRecPtr loc)
*** 1374,1379 ****
--- 1374,1387 ----
  		if (pq_flush_if_writable() != 0)
  			WalSndShutdown();
  
+ 		/*
+ 		 * If we have received CopyDone from the client, sent CopyDone
+ 		 * ourselves, and the output buffer is empty, it's time to exit
+ 		 * streaming, so fail the current WAL fetch request.
+ 		 */
+ 		if (!pq_is_send_pending() && streamingDoneSending && streamingDoneReceiving)
+ 			break;
+ 
  		now = GetCurrentTimestamp();
  
  		/* die if timeout was reached */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Apparent walsender bug triggered by logical replication

Reply via email to