Re: stress test for parallel workers

Tom Lane Sun, 06 Oct 2019 21:08:42 -0700

Thomas Munro <thomas.mu...@gmail.com> writes:
> On Wed, Aug 7, 2019 at 4:29 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
>> that would fork the postmaster, wait for the postmaster to exit, and then
>> report the exit status.  Where to report it *to* seems like the hard part,
>> but maybe an answer that worked for the buildfarm would be enough for now.


> Oh, right, you don't even need subreaper tricks (I was imagining we
> had a double fork somewhere we don't).

I got around to looking at how to do this.  Seeing that chipmunk hasn't
failed again, I'm inclined to write that off as perhaps unrelated.
That leaves us to diagnose the pg_upgrade failures on wobbegong and
vulpes.  The pg_upgrade test uses pg_ctl to start the postmaster,
and the only simple way to wedge this requirement into pg_ctl is as
attached.  Now, the attached is completely *not* suitable as a permanent
patch, because it degrades or breaks a number of pg_ctl behaviors that
rely on knowing the postmaster's actual PID rather than that of the
parent shell.  But it gets through check-world, so I think we can stick it
in transiently to see what it can teach us about the buildfarm failures.
Given wobbegong's recent failure rate, I don't think we'll have to wait
long.

Some notes about the patch:

* The core idea is to change start_postmaster's shell invocation
so that the shell doesn't just exec the postmaster, but runs a
mini shell script that runs the postmaster and then reports its
exit status.  I found that this still needed a dummy exec to force
the shell to perform the I/O redirections on itself, else pg_ctl's
TAP tests fail.  (I think what was happening was that if the shell
continued to hold open its original stdin, IPC::Run didn't believe
the command was done.)

* This means that what start_postmaster returns is not the postmaster's
own PID, but that of the parent shell.  So we have to lobotomize
wait_for_postmaster to handle the PID the same way as on Windows
(where that was already true); it can't test for exact equality
between the child process PID and what's in postmaster.pid.
(trap_sigint_during_startup is also broken, but we don't need that
to work to get through the regression tests.)

* That makes recovery/t/017_shm.pl fail, because there's a race
condition: after killing the old postmaster, the existing
postmaster.pid is enough to fool "pg_ctl start" into thinking the new
postmaster is already running.  I fixed that by making pg_ctl reject
any PID seen in a pre-existing postmaster.pid file.  That has a
nonzero probability of false match, so I would not want to stick it
in as a permanent thing on Unix ... but I wonder if it wouldn't be
an improvement over the current situation on Windows.

                        regards, tom lane

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index dd76be6..316651c 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -106,6 +106,7 @@ static char promote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];
 
 static volatile pgpid_t postmasterPID = -1;
+static pgpid_t old_postmaster_pid = 0;
 
 #ifdef WIN32
 static DWORD pgctl_start_type = SERVICE_AUTO_START;
@@ -490,16 +491,17 @@ start_postmaster(void)
 
 	/*
 	 * Since there might be quotes to handle here, it is easier simply to pass
-	 * everything to a shell to process them.  Use exec so that the postmaster
-	 * has the same PID as the current child process.
+	 * everything to a shell to process them.
+	 *
+	 * Since we aren't telling the shell to directly exec the postmaster,
+	 * the returned PID is a parent process, the same as on Windows.
 	 */
 	if (log_file != NULL)
-		snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" >> \"%s\" 2>&1",
-				 exec_path, pgdata_opt, post_opts,
-				 DEVNULL, log_file);
+		snprintf(cmd, MAXPGPATH, "exec < \"%s\" >> \"%s\" 2>&1; \"%s\" %s%s; echo postmaster exit status is $?",
+				 DEVNULL, log_file, exec_path, pgdata_opt, post_opts);
 	else
-		snprintf(cmd, MAXPGPATH, "exec \"%s\" %s%s < \"%s\" 2>&1",
-				 exec_path, pgdata_opt, post_opts, DEVNULL);
+		snprintf(cmd, MAXPGPATH, "exec < \"%s\" 2>&1; \"%s\" %s%s; echo postmaster exit status is $?",
+				 DEVNULL, exec_path, pgdata_opt, post_opts);
 
 	(void) execl("/bin/sh", "/bin/sh", "-c", cmd, (char *) NULL);
 
@@ -586,12 +588,8 @@ wait_for_postmaster(pgpid_t pm_pid, bool do_checkpoint)
 			pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
 			pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
 			if (pmstart >= start_time - 2 &&
-#ifndef WIN32
-				pmpid == pm_pid
-#else
-			/* Windows can only reject standalone-backend PIDs */
-				pmpid > 0
-#endif
+			/* If pid is the value we saw before starting, assume it's stale */
+				pmpid > 0 && pmpid != old_postmaster_pid
 				)
 			{
 				/*
@@ -621,7 +619,7 @@ wait_for_postmaster(pgpid_t pm_pid, bool do_checkpoint)
 		 * Check whether the child postmaster process is still alive.  This
 		 * lets us exit early if the postmaster fails during startup.
 		 *
-		 * On Windows, we may be checking the postmaster's parent shell, but
+		 * We may be checking the postmaster's parent shell, but
 		 * that's fine for this purpose.
 		 */
 #ifndef WIN32
@@ -823,13 +821,12 @@ do_init(void)
 static void
 do_start(void)
 {
-	pgpid_t		old_pid = 0;
 	pgpid_t		pm_pid;
 
 	if (ctl_command != RESTART_COMMAND)
 	{
-		old_pid = get_pgpid(false);
-		if (old_pid != 0)
+		old_postmaster_pid = get_pgpid(false);
+		if (old_postmaster_pid != 0)
 			write_stderr(_("%s: another server might be running; "
 						   "trying to start server anyway\n"),
 						 progname);

Re: stress test for parallel workers

Reply via email to