On Thu, Mar 26, 2026 at 3:40 PM Amit Kapila <[email protected]> wrote:
>
> On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <[email protected]> wrote:
> >
> > On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <[email protected]> wrote:
> > >
> > > On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <[email protected]> wrote:
> > > >
> > > > I noticed that during standby promotion the startup process sends 
> > > > SIGUSR1 to
> > > > the slotsync worker to make it exit. Is there a reason for using 
> > > > SIGUSR1?
> > > >
> > >
> > > IIRC, this same signal is used for both the backend executing
> > > pg_sync_replication_slots() and slotsync worker. We want the worker to
> > > exit and error_out backend. Using SIGTERM for backend could result in
> > > its exit.
> >
> > Why do we want the backend running pg_sync_replication_slots() to throw
> > an error here, rather than just exit? If emitting an error is really 
> > required,
> > another option would be to store the process type in SlotSyncCtx and send
> > different signals accordingly, for example, SIGTERM for the slotsync worker
> > and another signal for a backend. But it seems simpler and sufficient to 
> > have
> > the backend exit in this case as well.
> >
>
> As we want to retain the existing behavior for API, so instead of
> using two signals, we can achieve what you intend to achieve by one
> signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
> ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
> slotsync worker/backend via SendProcSignal. Then in
> procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
> HandleSlotSyncInterrupt() will set the InterruptPending and
> SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
> specific function based on the flag and do what we currently do in
> ProcessSlotSyncInterrupts. I think this should address the issue you
> are worried about.
>

+1
Retaining the current behavior for the API backend keeps it consistent
with other backends that continue after promotion.

In the reproduced case, the worker (or API backend) is waiting in:
libpqsrv_get_result -> WaitLatchOrSocket -> WaitEventSetWait.
When SIGUSR1 is received, it only sets the latch but does not mark any
interrupt as pending. As a result, CHECK_FOR_INTERRUPTS() is
effectively a no-op, and the process goes back to waiting. So, control
never returns to the slotsync code path, and we cannot rely on
stopSignaled to handle exit/error separately.
Only SIGTERM works here because its handler sets
INTERRUPTS_PENDING_CONDITION, allowing ProcessInterrupts() to run and
break the loop. The other signals like SIGUSR1 or SIGINT do not do
this, so simply using another signal might not solve the API error
handling case.

I’ve implemented the above approach suggested by Amit in the attached
patch and verified it for both worker and API scenarios. With this,
the API can now error-out without exiting the backend.

--
Thanks,
Nisha

Attachment: v2-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch
Description: Binary data

Reply via email to