Re: Unnecessary delay in streaming replication due to replay lag

Michael Paquier Mon, 08 Nov 2021 01:41:34 -0800

On Tue, Aug 24, 2021 at 09:51:25PM -0700, Soumyadeep Chakraborty wrote:
> Ashwin and I recently got a chance to work on this and we addressed all
> outstanding feedback and suggestions. PFA a significantly reworked patch.


+static void
+StartWALReceiverEagerly()
+{
The patch fails to apply because of the recent changes from Robert to
eliminate ThisTimeLineID.  The correct thing to do would be to add one
TimeLineID argument, passing down the local ThisTimeLineID in
StartupXLOG() and using XLogCtl->lastReplayedTLI in
CheckRecoveryConsistency().

+       /*
+        * We should never reach here. We should have at least one valid WAL
+        * segment in our pg_wal, for the standby to have started.
+        */
+       Assert(false);
The reason behind that is not that we have a standby, but that we read
at least the segment that included the checkpoint record we are
replaying from, at least (it is possible for a standby to start
without any contents in pg_wal/ as long as recovery is configured),
and because StartWALReceiverEagerly() is called just after that.

It would be better to make sure that StartWALReceiverEagerly() gets
only called from the startup process, perhaps?

+       RequestXLogStreaming(ThisTimeLineID, startptr, PrimaryConnInfo,
+                            PrimarySlotName, wal_receiver_create_temp_slot);
+       XLogReaderFree(state);
XLogReaderFree() should happen before RequestXLogStreaming().  The
tipping point of the patch is here, where the WAL receiver is started
based on the location of the first valid WAL record found.

wal_receiver_start_condition is missing in postgresql.conf.sample.

+       /*
+        * Start WAL receiver eagerly if requested.
+        */
+       if (StandbyModeRequested && !WalRcvStreaming() &&
+               PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
+               wal_receiver_start_condition == WAL_RCV_START_AT_STARTUP)
+               StartWALReceiverEagerly();
[...]
+       if (StandbyModeRequested && !WalRcvStreaming() && reachedConsistency &&
+               PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0 &&
+               wal_receiver_start_condition == WAL_RCV_START_AT_CONSISTENCY)
+               StartWALReceiverEagerly();
This repeats two times the same set of conditions, which does not look
like a good idea to me.  I think that you'd better add an extra
argument to StartWALReceiverEagerly to track the start timing expected
in this code path, that will be matched with the GUC in the routine.
It would be better to document the reasons behind each check done, as
well.

+       /* Find the latest and earliest WAL segments in pg_wal */
+       dir = AllocateDir("pg_wal");
+       while ((de = ReadDir(dir, "pg_wal")) != NULL)
+       {
[ ... ]
+       /* Find the latest valid WAL segment and request streaming from its 
start */
+       while (endsegno >= startsegno)
+       {
[...]
+               XLogReaderFree(state);
+               endsegno--;
+       }
So, this reads the contents of pg_wal/ for any files that exist, then
goes down to the first segment found with a valid beginning.  That's
going to be expensive with a large max_wal_size.  When searching for a
point like that, a dichotomy method would be better to calculate a LSN
you'd like to start from.  Anyway, I think that there is a problem
with the approach: what should we do if there are holes in the
segments present in pg_wal/?  As of HEAD, or
wal_receiver_start_condition = 'exhaust' in this patch, we would
switch across local pg_wal/, archive and stream in a linear way,
thanks to WaitForWALToBecomeAvailable().  For example, imagine that we
have a standby with the following set of valid segments, because of
the buggy way a base backup has been taken:
000000010000000000000001
000000010000000000000003
000000010000000000000005
What the patch would do is starting a WAL receiver from segment 5,
which is in contradiction with the existing logic where we should try
to look for the segment once we are waiting for something in segment
2.  This would be dangerous once the startup process waits for some
WAL to become available, because we have a WAL receiver started, but
we cannot fetch the segment we have.  Perhaps a deployment has
archiving, in which case it would be able to grab segment 2 (if no
archiving, recovery would not be able to move on, so that would be
game over).
 
         /*
          * Move to XLOG_FROM_STREAM state, and set to start a
-         * walreceiver if necessary.
+         * walreceiver if necessary. The WAL receiver may have
+         * already started (if it was configured to start
+         * eagerly).
          */
         currentSource = XLOG_FROM_STREAM;
-        startWalReceiver = true;
+        startWalReceiver = !WalRcvStreaming();
         break;
     case XLOG_FROM_ARCHIVE:
     case XLOG_FROM_PG_WAL:
 
-        /*
-         * WAL receiver must not be running when reading WAL from
-         * archive or pg_wal.
-         */
-        Assert(!WalRcvStreaming());

These parts should IMO not be changed.  They are strong assumptions we
rely on in the startup process, and this comes down to the fact that
it is not a good idea to mix a WAL receiver started while
currentSource could be pointing at a WAL source completely different. 
That's going to bring a lot of racy conditions, I am afraid, as we
rely on currentSource a lot during recovery, in combination that we
expect the code to be able to retrieve WAL in a linear fashion from
the LSN position that recovery is looking for.

So, I think that deciding if a WAL receiver should be started blindly
outside of the code path deciding if the startup process is waiting
for some WAL is not a good idea, and the position we may begin to
stream from may be something that we may have zero need for at the
end (this is going to be tricky if we detect a TLI jump while
replaying the local WAL, also?).  The issue is that I am not sure what
a good design for that should be.  We have no idea when the startup
process will need WAL from a different source until replay comes
around, but what we want here is to anticipate othis LSN :)

I am wondering if there should be a way to work out something with the
control file, though, but things can get very fancy with HA
and base backup deployments and the various cases we support thanks to
the current way recovery works, as well.  We could also go simpler and
rework the priority order if both archiving and streaming are options
wanted by the user.
--
Michael

signature.asc
Description: PGP signature

Re: Unnecessary delay in streaming replication due to replay lag

Reply via email to