On Thu, Mar 26, 2026 at 4:08 PM Nisha Moond <[email protected]> wrote:
>
> On Thu, Mar 26, 2026 at 3:40 PM Amit Kapila <[email protected]> wrote:
> >
> > On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <[email protected]> wrote:
> > >
> > > On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <[email protected]> 
> > > wrote:
> > > >
> > > > On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <[email protected]> 
> > > > wrote:
> > > > >
> > > > > I noticed that during standby promotion the startup process sends 
> > > > > SIGUSR1 to
> > > > > the slotsync worker to make it exit. Is there a reason for using 
> > > > > SIGUSR1?
> > > > >
> > > >
> > > > IIRC, this same signal is used for both the backend executing
> > > > pg_sync_replication_slots() and slotsync worker. We want the worker to
> > > > exit and error_out backend. Using SIGTERM for backend could result in
> > > > its exit.
> > >
> > > Why do we want the backend running pg_sync_replication_slots() to throw
> > > an error here, rather than just exit? If emitting an error is really 
> > > required,
> > > another option would be to store the process type in SlotSyncCtx and send
> > > different signals accordingly, for example, SIGTERM for the slotsync 
> > > worker
> > > and another signal for a backend. But it seems simpler and sufficient to 
> > > have
> > > the backend exit in this case as well.
> > >
> >
> > As we want to retain the existing behavior for API, so instead of
> > using two signals, we can achieve what you intend to achieve by one
> > signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
> > ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
> > slotsync worker/backend via SendProcSignal. Then in
> > procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
> > HandleSlotSyncInterrupt() will set the InterruptPending and
> > SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
> > specific function based on the flag and do what we currently do in
> > ProcessSlotSyncInterrupts. I think this should address the issue you
> > are worried about.
> >
>
> +1
> Retaining the current behavior for the API backend keeps it consistent
> with other backends that continue after promotion.
>
> In the reproduced case, the worker (or API backend) is waiting in:
> libpqsrv_get_result -> WaitLatchOrSocket -> WaitEventSetWait.
> When SIGUSR1 is received, it only sets the latch but does not mark any
> interrupt as pending. As a result, CHECK_FOR_INTERRUPTS() is
> effectively a no-op, and the process goes back to waiting. So, control
> never returns to the slotsync code path, and we cannot rely on
> stopSignaled to handle exit/error separately.
> Only SIGTERM works here because its handler sets
> INTERRUPTS_PENDING_CONDITION, allowing ProcessInterrupts() to run and
> break the loop. The other signals like SIGUSR1 or SIGINT do not do
> this, so simply using another signal might not solve the API error
> handling case.
>
> I’ve implemented the above approach suggested by Amit in the attached
> patch and verified it for both worker and API scenarios. With this,
> the API can now error-out without exiting the backend.
>

+1 on the idea. Few comments:

1)
It was not clear initially as to why SetLatch is not done in
HandleSlotSyncShutdownInterrupt(), digging it further revealed that
procsignal_sigusr1_handler() will do SetLatch outside. Perhaps you can
add below comment at the end of HandleSlotSyncShutdownInterrupt()
similar to how other functions (HandleProcSignalBarrierInterrupt,
HandleRecoveryConflictInterrupt etc) do.

/* latch will be set by procsignal_sigusr1_handler */

2)
In ProcessSlotSyncInterrupts(), now we don't need the below logic right?

if (SlotSyncCtx->stopSignaled)
    {
        if (AmLogicalSlotSyncWorkerProcess())
        {
            ...
            proc_exit(0);
        }
        else
        {
            /*
             * For the backend executing SQL function
             * pg_sync_replication_slots().
             */
            ereport(ERROR,
                    errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                    errmsg("replication slot synchronization will stop
because promotion is triggered"));
        }
    }

thanks
Shveta


Reply via email to