On Fri, May 13, 2022 at 6:41 PM Robert Haas <robertmh...@gmail.com> wrote: > > On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy > <bharath.rupireddyforpostg...@gmail.com> wrote: > > Here's the rebased v9 patch. > > This seems like it has enormous overlap with the existing > functionality that we have from log_startup_progress_interval. > > I think that facility is also better-designed than this one. It prints > out a message based on elapsed time, whereas this patch prints out a > message based progress through the WAL. That means that if WAL replay > isn't actually advancing for some reason, you just won't get any log > messages and you don't know whether it's advancing slowly or not at > all or the server is just hung. With that facility you can distinguish > those cases. > > Also, if for some reason we do think that amount of WAL replayed is > the right metric, rather than time, why would we only allow high=1 > segment and low=128 segments, rather than say any number of MB or GB > that the user would like to configure? > > I suggest that if log_startup_progress_interval doesn't meet your > needs here, we should try to understand why not and maybe enhance it, > instead of adding a separate facility.
After thinking for a while, I agree with Robert and others that we could leverage the existing log_startup_progress_interval mechanism for reporting which WAL file currently is being replayed. I added current TLI (helps to construct the WAL file name from current LSN) and current WAL file source (helps to know where the WAL files was fetched from) to the existing "redo in progress, elapsed time:..." message. This very well serves the purpose of identifying the issues such as the restore command taking a lot of time (> log_startup_progress_interval for instance), WAL replay rate on standby or primary for long recoveries and so on. However, ereport_startup_progress isn't enabled on standby to not let it bloat the server logs. I believe the "redo in progress, elapsed time:..." message can provide some important info/metric for standby too and there's no way for the users to enable it on standbys today. For instance, users can know how well the standby fares in replaying, they can figure this out, by looking at two or more such messages. If enabled, with default value of 10 sec for log_startup_progress_interval, the standby can emit 8640 messages per day which is too much. I'm not sure if we are okay to change the default value of log_startup_progress_interval to say 1min or 5min so that 1440 messages are emitted. In production environments, typically users may or may not be interested if recovery takes just 10sec, but they really are interested if it takes in the order of minutes. Basically, I would like to enable "redo in progress, elapsed time:..." message for standbys too. Thoughts? PSA v10 patch with enhanced "redo in progress, elapsed time:..." message. Note that it's not a final patch though. Regards, Bharath Rupireddy.
v10-0001-Add-WAL-recovery-info-to-startup-progress-log-me.patch
Description: Binary data