On Mon, 25 Mar 2024 at 21:36, Hayato Kuroda (Fujitsu)
<kuroda.hay...@fujitsu.com> wrote:
>
> Dear Bharath, Peter,
>
> > Looks like BF animals aren't happy, please check -
> > > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl.
> >
> > Looks like sanitizer failures.  There were a few messages about that
> > recently, but those were all just about freeing memory after use, which
> > we don't necessarily require for client programs.  So maybe something else.
>
> It seems that there are several time of failures, [1] and [2].
>
> ## Analysis for failure 1
>
> The failure caused by a time lag between walreceiver finishes and 
> pg_is_in_recovery()
> returns true.
>
> According to the output [1], it seems that the tool failed at 
> wait_for_end_recovery()
> with the message "standby server disconnected from the primary". Also, lines
> "redo done at..." and "terminating walreceiver process due to administrator 
> command"
> meant that walreceiver was requested to shut down by XLogShutdownWalRcv().
>
> According to the source, we confirm that walreceiver is shut down in
> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
> SharedRecoveryState
> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
> true)
> at the latter part of StartupXLOG().
>
> So, if there is a delay between FinishWalRecovery() and change the state, the 
> check
> in wait_for_end_recovery() would be failed during the time. Since we allow to 
> miss
> the walreceiver 10 times and it is checked once per second, the failure 
> occurs if
> the time lag is longer than 10 seconds.
>
> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
> larger,
> but it's not a fundamental solution.

I agree with your analysis, another way to fix could be to remove the
following check as increasing the count might still have the race
condition issue:
/*
* If it is still in recovery, make sure the target server is
* connected to the primary so it can receive the required WAL to
* finish the recovery process. If it is disconnected try
* NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
*/
res = PQexec(conn,
"SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");

I'm not sure whether we should worry about the condition where
recovery is not done and pg_stat_wal_receiver is exited as we have the
following sanity check in check_subscriber before we wait for recovery
to be finished:
/* The target server must be a standby */
if (!server_is_in_recovery(conn))
{
pg_log_error("target server must be a standby");
disconnect_database(conn, true);
}

Regards,
Vignesh


Reply via email to