On Tue, Nov 23, 2021 at 1:39 AM Soumyadeep Chakraborty <soumyadeep2...@gmail.com> wrote: > > Hi Bharath, > > Yes, that thread has been discussed here. Asim had x-posted the patch to [1]. > This thread > was more recent when Ashwin and I picked up the patch in Aug 2021, so we > continued here. > The patch has been significantly updated by us, addressing Michael's long > outstanding feedback.
Thanks for the patch. I reviewed it a bit, here are some comments: 1) A memory leak: add FreeDir(dir); before returning. + ereport(LOG, + (errmsg("Could not start streaming WAL eagerly"), + errdetail("There are timeline changes in the locally available WAL files."), + errhint("WAL streaming will begin once all local WAL and archives are exhausted."))); + return; + } 2) Is there a guarantee that while we traverse the pg_wal directory to find startsegno and endsegno, the new wal files arrive from the primary or archive location or old wal files get removed/recycled by the standby? Especially when wal_receiver_start_condition=consistency? + startsegno = (startsegno == -1) ? logSegNo : Min(startsegno, logSegNo); + endsegno = (endsegno == -1) ? logSegNo : Max(endsegno, logSegNo); + } 3) I think the errmsg text format isn't correct. Note that the errmsg text starts with lowercase and doesn't end with "." whereas errdetail or errhint starts with uppercase and ends with ".". Please check other messages for reference. The following should be changed. + errmsg("Requesting stream from beginning of: %s", + errmsg("Invalid WAL segment found while calculating stream start: %s. Skipping.", + (errmsg("Could not start streaming WAL eagerly"), 4) I think you also need to have wal files names in double quotes, something like below: errmsg("could not close file \"%s\": %m", xlogfname))); 5) It is ".....stream start: \"%s\", skipping..", + errmsg("Invalid WAL segment found while calculating stream start: %s. Skipping.", 4) I think the patch can make the startup process significantly slow, especially when there are lots of wal files that exist in the standby pg_wal directory. This is because of the overhead StartWALReceiverEagerlyIfPossible adds i.e. doing two while loops to figure out the start position of the streaming in advance. This might end up the startup process doing the loop over in the directory rather than the important thing of doing crash recovery or standby recovery. 5) What happens if this new GUC is enabled in case of a synchronous standby? What happens if this new GUC is enabled in case of a crash recovery? What happens if this new GUC is enabled in case a restore command is set i.e. standby performing archive recovery? 6) How about bgwriter/checkpointer which gets started even before the startup process (or a new bg worker? of course it's going to be an overkill) finding out the new start pos for the startup process and then we could get rid of <literal>startup</literal> behaviour of the patch? This avoids an extra burden on the startup process. Many times, users will be complaining about why recovery is taking more time now, after the GUC wal_receiver_start_condition=startup. 7) I think we can just have 'consistency' and 'exhaust' behaviours and let the bgwrite or checkpointer find out the start position for the startup process, so the startup process whenever reaches a consistent point, it sees if the other process has calculated start pos for it or not, if yes it starts wal receiver other wise it goes with its usual recovery. I'm not sure if this will be a good idea. 8) Can we have a better GUC name than wal_receiver_start_condition? Something like wal_receiver_start_at or wal_receiver_start or some other? Regards, Bharath Rupireddy.