On Wed, Mar 26, 2025 at 05:17:17PM +0300, Maksim.Melnikov wrote: > Test check visibility of prepared transactions in standby after a restart > while primary is down. > Failed assert check that changes, commited on transaction on primary node, > were synchronously replicated to standby node. > In test flow we have primary server stop/start events. > Assert failure is hard to reproduce, but I found how to do it synthetically, > can do it by injecting in code synthetical sleep call > > --- a/src/backend/replication/syncrep.c > +++ b/src/backend/replication/syncrep.c > @@ -924,6 +924,7 @@ SyncRepUpdateSyncStandbysDefined(void) > > if (sync_standbys_defined != WalSndCtl->sync_standbys_defined) > { > + sleep(2); > LWLockAcquire(SyncRepLock, LW_EXCLUSIVE); > > /*
Nice investigation to find out that this specific path faces a race here. Tests on Windows are usually slower and good at catching these. It may be possible that the buildfarm has reported that, actually. > So my question is > can we accept such behavior as bug of postgres codebase, or, on other way, > it is bug of unstable test? > P.S. > For me it is looking strange, that postgres start processing workload before > full config initializing. > Attached possible patch for codebase. Yeah, I agree that it is a bug, and your approach to be more careful in SyncRepWaitForLSN() so as we wait until the configuration is loaded and set from the checkpointer sounds like the sensible thing to do, because the backends have no idea yet what they should do as the configuration is not loaded. Tracking if the configuration has been properly loaded in WalSndCtlData makes sense here, but I am confused by the patch: you have defined SYNCSTANDBYSDEFINED but sync_standbys_defined never sets it. It may be simpler to use a separate boolean flag for this purpose. WalSndCtlData is a private structure holding the state of the WAL senders internally, so ABI does not stress me much here (flexible array at the end of the structure, as well..). Also mentioned by Kirill downthread: let's add a regression test with an injection point that does a wait. This race condition is worth having a test for. You could define the point of where you have added your hardcoded sleep(), for example. -- Michael
signature.asc
Description: PGP signature