On Mon, 25 Mar 2024 at 21:36, Hayato Kuroda (Fujitsu) <kuroda.hay...@fujitsu.com> wrote: > > Dear Bharath, Peter, > > > Looks like BF animals aren't happy, please check - > > > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl. > > > > Looks like sanitizer failures. There were a few messages about that > > recently, but those were all just about freeing memory after use, which > > we don't necessarily require for client programs. So maybe something else. > > It seems that there are several time of failures, [1] and [2]. > > ## Analysis for failure 1 > > The failure caused by a time lag between walreceiver finishes and > pg_is_in_recovery() > returns true. > > According to the output [1], it seems that the tool failed at > wait_for_end_recovery() > with the message "standby server disconnected from the primary". Also, lines > "redo done at..." and "terminating walreceiver process due to administrator > command" > meant that walreceiver was requested to shut down by XLogShutdownWalRcv(). > > According to the source, we confirm that walreceiver is shut down in > StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, > SharedRecoveryState > is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return > true) > at the latter part of StartupXLOG(). > > So, if there is a delay between FinishWalRecovery() and change the state, the > check > in wait_for_end_recovery() would be failed during the time. Since we allow to > miss > the walreceiver 10 times and it is checked once per second, the failure > occurs if > the time lag is longer than 10 seconds. > > I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS > larger, > but it's not a fundamental solution.
I agree with your analysis, another way to fix could be to remove the following check as increasing the count might still have the race condition issue: /* * If it is still in recovery, make sure the target server is * connected to the primary so it can receive the required WAL to * finish the recovery process. If it is disconnected try * NUM_CONN_ATTEMPTS in a row and bail out if not succeed. */ res = PQexec(conn, "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver"); I'm not sure whether we should worry about the condition where recovery is not done and pg_stat_wal_receiver is exited as we have the following sanity check in check_subscriber before we wait for recovery to be finished: /* The target server must be a standby */ if (!server_is_in_recovery(conn)) { pg_log_error("target server must be a standby"); disconnect_database(conn, true); } Regards, Vignesh