On Tue, Mar 31, 2026 at 4:12 PM shveta malik <[email protected]> wrote: > > On Tue, Mar 31, 2026 at 11:35 AM Nisha Moond <[email protected]> wrote: > > > > On Mon, Mar 30, 2026 at 4:39 PM Fujii Masao <[email protected]> wrote: > > > > > > On Mon, Mar 30, 2026 at 1:18 PM Nisha Moond <[email protected]> > > > wrote: > > > > We were using the same log message in two places: > > > > check_and_set_sync_info() and HandleSlotSyncMessage(). > > > > I think “will not start” fits better in the first case, while “will > > > > stop” makes sense to keep in the second. > > > > > > Thanks for updating the patch! > > > > > > With the patch, in my testing, standby promotion always produces > > > the following logs: > > > > > > LOG: replication slot synchronization worker will stop because > > > promotion is triggered > > > LOG: replication slot synchronization worker will not start > > > because promotion was triggered > > > > > > It looks like the postmaster immediately restarts the slotsync worker > > > after > > > promotion terminates it, and that new worker then exits on seeing > > > SlotSyncCtx->stopSignaled. > > > > > > IMO, always emitting both messages is a bit confusing. It would be nice to > > > suppress the second one if possible. > > > > > > One idea would be to prevent the restart altogether. For example, > > > ProcessSlotSyncMessage() could set SlotSyncCtx->last_start_time to > > > a special value (like -1), and SlotSyncWorkerCanRestart() could return > > > false (i.e., prevent postmater from starting up slotsync worker) when > > > it sees that. Alternatively, SlotSyncWorkerCanRestart() could simply > > > check SlotSyncCtx->stopSignaled. > > > > > > That said, as far as I remember correctly, postmaster is generally not > > > supposed to touch shared memory (per the comments in postmaster.c), > > > so I'm not sure this approach is acceptable. On the other hand, > > > postmaster and the slotsync worker already rely on > > > SlotSyncCtx->last_start_time, > > > so perhaps there's some precedent here. > > > > > IIUC, checking SlotSyncCtx->stopSignaled in SlotSyncWorkerCanRestart() > > may not be ideal, as it requires a spinlock to avoid races with the > > startup process and it is disallowed to take lock in postmaster main > > loop. Whereas, SlotSyncCtx->last_start_time doesn’t need a lock since > > the postmaster accesses it only when the worker is not alive. > > > > I agree. > > > Another option could be to log in check_and_set_sync_info() at DEBUG1 > > instead of LOG level. This message appears only after stopSignaled is > > set, when promotion is already in progress and the first worker has > > logged “will stop…”. The second worker doesn’t do any real work. Since > > there’s nothing actionable for users, using DEBUG1 would keep it > > useful for debugging (e.g., noticing immediate restarts) while > > avoiding extra log noise. Thoughts? > > > > +1. > > Do you think we can slightly tweak the comment in atop file to: > > On promotion the startup process sets 'stopSignaled' and uses this > 'pid' to signal synchronizing process with PROCSIG_SLOTSYNC_MESSAGE > and also to wake it up so that the process can immediately stop its > synchronizing work. Setting 'stopSignaled' on the other hand is used > to handle the race condition.... >
Done. > Also shall we quick exit ProcessSlotSyncMessage() if syncing is > already finished by API? > Make sense. Fixed. Please find the updated patch (v7) attached. -- Thanks, Nisha
v7-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch
Description: Binary data
