Hi, On Wed, 15 Apr 2026 at 23:31, Ayush Tiwari <[email protected]> wrote:
> Hi > > On Wed, 15 Apr 2026 at 22:21, Heikki Linnakangas <[email protected]> wrote: > >> On 15/04/2026 16:57, Ayush Tiwari wrote: >> > Hi, >> > >> > The comment above the PM_STARTUP startup-process-failure case still says >> > that there are no other processes running yet, so the postmaster can >> just >> > exit. >> > >> > That no longer matches the current startup flow: PM_STARTUP may already >> > have auxiliary processes running by that point. The attached patch >> updates >> > that comment to describe the current behavior. >> >> Hmm, shouldn't the postmaster kill and wait for the auxiliary processes >> to exit first in that case? ISTM we need code changes here, not just >> comments. >> >> - Heikki >> >> > Yes, I agree, code change is required here. > > The proper thing is to > route this through the existing crash-handling path so the postmaster > SIGQUITs the aux children and waits for them to exit before terminating. > > I think the minimal change is: > > 1. Replace the ExitPostmaster(1) shortcut in the PM_STARTUP > startup-failure case with HandleChildCrash(), which calls > TerminateChildren(SIGQUIT) and transitions through the state > machine. Set StartupStatus = STARTUP_CRASHED so the state > machine does not try to reinitialize. > > 2. Let HandleFatalError() handle PM_STARTUP by transitioning to > PM_WAIT_BACKENDS, instead of the current Assert(false). > > The minimal fix turned out to be smaller than I first described, the existing paragraph immediately below the ExitPostmaster(1) block already handles !EXIT_STATUS_0 with StartupStatus != STARTUP_SIGNALED correctly (sets STARTUP_CRASHED and HandleChildCrash). So, likely fix would be: 1. Deleting the PM_STARTUP ExitPostmaster(1) shortcut, and letting execution fall through to the next stanza. 2. Replacing the Assert(false) for PM_STARTUP in HandleFatalError() with a fall-through to UpdatePMState(PM_WAIT_BACKENDS). Verification that I did for patch: On a fresh initdb'd cluster, I zeroed out the first WAL segment to force the startup process to FATAL at StartupXLOG, then ran PG in foreground under strace. Before (master): LOG: startup process (PID N) exited with exit code 1 LOG: aborting startup due to startup process failure LOG: database system is shut down strace of the postmaster PID shows 0 kill() calls to children before exit_group(1). Checkpointer, bgwriter and io workers were running at the time of the failure and were orphaned. After (patched): LOG: startup process (PID N) exited with exit code 1 LOG: terminating any other active server processes <state transitions PM_STARTUP -> PM_WAIT_BACKENDS -> PM_WAIT_DEAD_END -> PM_NO_CHILDREN> LOG: shutting down due to startup process failure LOG: database system is shut down strace shows 8 SIGQUIT deliveries (4 children, each signaled by PID and by process-group) before the postmaster's own exit_group(1). I've attached a patch, please review and let me know your thoughts. Regards, Ayush
0001-postmaster-drain-aux-processes-on-startup-process-fa.patch
Description: Binary data
