On Sat, Apr 6, 2024 at 12:25 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > On Fri, Apr 5, 2024 at 10:31 AM shveta malik <shveta.ma...@gmail.com> wrote: > > > > Please find v2. Changes are: > > Thanks for the patch. Here are some comments.
Thanks for reviewing. > > 1. Can we have a clear saying in the shmem variable who's syncing at > the moment? Is it a slot sync worker or a backend via SQL function? > Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;" > > typedef enum SlotSyncSource > { > SLOT_SYNC_NONE, > SLOT_SYNC_WORKER, > SLOT_SYNC_BACKEND, > } SlotSyncSource; > > Then, the check in ShutDownSlotSync can be: > > + /* > + * Return if neither the slot sync worker is running nor the function > + * pg_sync_replication_slots() is executing. > + */ > + if ((SlotSyncCtx->pid == InvalidPid) && > SlotSyncCtx->sync_source != SLOT_SYNC_BACKEND) > { > > 2. > SyncReplicationSlots(WalReceiverConn *wrconn) > { > + /* > + * Startup process signaled the slot sync to stop, so if meanwhile user > + * has invoked slot sync SQL function, simply return. > + */ > + SpinLockAcquire(&SlotSyncCtx->mutex); > + if (SlotSyncCtx->stopSignaled) > + { > + ereport(LOG, > + errmsg("skipping slot synchronization as slot sync > shutdown is signaled during promotion")); > + > > Unless I'm missing something, I think this can't detect if the backend > via SQL function is already half-way through syncing in > synchronize_one_slot. So, better move this check to (or also have it > there) slot sync loop that calls synchronize_one_slot. To avoid > spinlock acquisitions, we can perhaps do this check in when we acquire > the spinlock for synced flag. If the sync via SQL function is already half-way, then promotion should wait for it to finish. I don't think it is a good idea to move the check to synchronize_one_slot(). The sync-call should either not start (if it noticed the promotion) or finish the sync and then let promotion proceed. But I would like to know others' opinion on this. > > /* Search for the named slot */ > if ((slot = SearchNamedReplicationSlot(remote_slot->name, true))) > { > bool synced; > > SpinLockAcquire(&slot->mutex); > synced = slot->data.synced; > << get SlotSyncCtx->stopSignaled here >> > SpinLockRelease(&slot->mutex); > > << do the slot sync skip check here if (stopSignaled) >> > > 3. Can we have a test or steps at least to check the consequences > manually one can get if slot syncing via SQL function is happening > during the promotion? > > IIUC, we need to ensure there is no backend acquiring it and > performing sync while the slot sync worker is shutting down/standby > promotion is occuring. Otherwise, some of the slots can get resynced > and some are not while we are shutting down the slot sync worker as > part of the standby promotion which might leave the slots in an > inconsistent state. I do not think that we can reach a state (exception is some error scenario) where some of the slots are synced while the rest are not during a *particular* sync-cycle only because promotion is going in parallel. (And yes we need to fix the race-condition stated by Amit up-thread for this statement to be true.) thanks Shveta