Thomas Munro <thomas.mu...@gmail.com> writes: > On Wed, Aug 7, 2019 at 4:29 PM Tom Lane <t...@sss.pgh.pa.us> wrote: >> Yeah, I've been wondering whether pg_ctl could fork off a subprocess >> that would fork the postmaster, wait for the postmaster to exit, and then >> report the exit status. Where to report it *to* seems like the hard part, >> but maybe an answer that worked for the buildfarm would be enough for now.
> Oh, right, you don't even need subreaper tricks (I was imagining we > had a double fork somewhere we don't). I got around to looking at how to do this. Seeing that chipmunk hasn't failed again, I'm inclined to write that off as perhaps unrelated. That leaves us to diagnose the pg_upgrade failures on wobbegong and vulpes. The pg_upgrade test uses pg_ctl to start the postmaster, and the only simple way to wedge this requirement into pg_ctl is as attached. Now, the attached is completely *not* suitable as a permanent patch, because it degrades or breaks a number of pg_ctl behaviors that rely on knowing the postmaster's actual PID rather than that of the parent shell. But it gets through check-world, so I think we can stick it in transiently to see what it can teach us about the buildfarm failures. Given wobbegong's recent failure rate, I don't think we'll have to wait long. Some notes about the patch: * The core idea is to change start_postmaster's shell invocation so that the shell doesn't just exec the postmaster, but runs a mini shell script that runs the postmaster and then reports its exit status. I found that this still needed a dummy exec to force the shell to perform the I/O redirections on itself, else pg_ctl's TAP tests fail. (I think what was happening was that if the shell continued to hold open its original stdin, IPC::Run didn't believe the command was done.) * This means that what start_postmaster returns is not the postmaster's own PID, but that of the parent shell. So we have to lobotomize wait_for_postmaster to handle the PID the same way as on Windows (where that was already true); it can't test for exact equality between the child process PID and what's in postmaster.pid. (trap_sigint_during_startup is also broken, but we don't need that to work to get through the regression tests.) * That makes recovery/t/017_shm.pl fail, because there's a race condition: after killing the old postmaster, the existing postmaster.pid is enough to fool "pg_ctl start" into thinking the new postmaster is already running. I fixed that by making pg_ctl reject any PID seen in a pre-existing postmaster.pid file. That has a nonzero probability of false match, so I would not want to stick it in as a permanent thing on Unix ... but I wonder if it wouldn't be an improvement over the current situation on Windows. regards, tom lane
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index dd76be6..316651c 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -106,6 +106,7 @@ static char promote_file[MAXPGPATH]; static char logrotate_file[MAXPGPATH]; static volatile pgpid_t postmasterPID = -1; +static pgpid_t old_postmaster_pid = 0; #ifdef WIN32 static DWORD pgctl_start_type = SERVICE_AUTO_START; @@ -490,16 +491,17 @@ start_postmaster(void) /* * Since there might be quotes to handle here, it is easier simply to pass - * everything to a shell to process them. Use exec so that the postmaster - * has the same PID as the current child process. + * everything to a shell to process them. + * + * Since we aren't telling the shell to directly exec the postmaster, + * the returned PID is a parent process, the same as on Windows. */ if (log_file != NULL) - snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1", - exec_path, pgdata_opt, post_opts, - DEVNULL, log_file); + snprintf(cmd, MAXPGPATH, "exec < \"%s\" >> \"%s\" 2>&1; \"%s\" %s%s; echo postmaster exit status is $?", + DEVNULL, log_file, exec_path, pgdata_opt, post_opts); else - snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1", - exec_path, pgdata_opt, post_opts, DEVNULL); + snprintf(cmd, MAXPGPATH, "exec < \"%s\" 2>&1; \"%s\" %s%s; echo postmaster exit status is $?", + DEVNULL, exec_path, pgdata_opt, post_opts); (void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL); @@ -586,12 +588,8 @@ wait_for_postmaster(pgpid_t pm_pid, bool do_checkpoint) pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); if (pmstart >= start_time - 2 && -#ifndef WIN32 - pmpid == pm_pid -#else - /* Windows can only reject standalone-backend PIDs */ - pmpid > 0 -#endif + /* If pid is the value we saw before starting, assume it's stale */ + pmpid > 0 && pmpid != old_postmaster_pid ) { /* @@ -621,7 +619,7 @@ wait_for_postmaster(pgpid_t pm_pid, bool do_checkpoint) * Check whether the child postmaster process is still alive. This * lets us exit early if the postmaster fails during startup. * - * On Windows, we may be checking the postmaster's parent shell, but + * We may be checking the postmaster's parent shell, but * that's fine for this purpose. */ #ifndef WIN32 @@ -823,13 +821,12 @@ do_init(void) static void do_start(void) { - pgpid_t old_pid = 0; pgpid_t pm_pid; if (ctl_command != RESTART_COMMAND) { - old_pid = get_pgpid(false); - if (old_pid != 0) + old_postmaster_pid = get_pgpid(false); + if (old_postmaster_pid != 0) write_stderr(_("%s: another server might be running; " "trying to start server anyway\n"), progname);