Re: [HACKERS] bgworker crashed or not?

Petr Jelinek Wed, 07 May 2014 07:26:36 -0700

On 07/05/14 02:25, Petr Jelinek wrote:

On 06/05/14 19:05, Robert Haas wrote:

What I'm inclined to do is change the logic so that:


(1) After a crash-and-restart sequence, zero rw->rw_crashed_at, so
that anything which is still registered gets restarted immediately.


Yes, that's quite obvious change which I missed completely :).

(2) If a shmem-connected backend fails to release the deadman switch
or exits with an exit code other than 0 or 1, we crash-and-restart.  A
non-shmem-connected backend never causes a crash-and-restart.

+1

(3) When a background worker exits without triggering a
crash-and-restart, an exit code of precisely 0 causes the worker to be
unregistered; any other exit code has no special effect, so
bgw_restart_time controls.

+1


Ok, attached patch is my try at the proposed changes.

I don't like the reset_bgworker_crash_state function name, maybe you'llcome up with something better...

I passes my tests for the desired behavior, hopefully I didn't miss somescenario.


--
 Petr Jelinek                  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index a6b25d8..e913395 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -472,13 +472,13 @@ bgworker_quickdie(SIGNAL_ARGS)
 	on_exit_reset();
 
 	/*
-	 * Note we do exit(0) here, not exit(2) like quickdie.  The reason is that
-	 * we don't want to be seen this worker as independently crashed, because
-	 * then postmaster would delay restarting it again afterwards.  If some
-	 * idiot DBA manually sends SIGQUIT to a random bgworker, the "dead man
-	 * switch" will ensure that postmaster sees this as a crash.
+	 * Note we do exit(2) here, not exit(0) for same reasons quickdie does.
+	 * Another reason is that bgworker is not restarted after exit(0).
+	 *
+	 * The bgw_restart_time does not apply here and the bgworker will be
+	 * restarted immediately.
 	 */
-	exit(0);
+	exit(2);
 }
 
 /*
@@ -832,8 +832,9 @@ RegisterDynamicBackgroundWorker(BackgroundWorker *worker,
  * running but is no longer.
  *
  * In the latter case, the worker may be stopped temporarily (if it is
- * configured for automatic restart, or if it exited with code 0) or gone
- * for good (if it is configured not to restart and exited with code 1).
+ * configured for automatic restart and exited with code 1) or gone for
+ * good (if it exited with code other than 1 or if it is configured not
+ * to restart).
  */
 BgwHandleStatus
 GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pidp)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6d09887..99d5e85 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -396,6 +396,7 @@ static void TerminateChildren(int signal);
 
 static int	CountChildren(int target);
 static int	CountUnconnectedWorkers(void);
+static void reset_bgworker_crash_state(void);
 static void maybe_start_bgworker(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static pid_t StartChildProcess(AuxProcType type);
@@ -2616,7 +2617,10 @@ reaper(SIGNAL_ARGS)
 			if (PgStatPID == 0)
 				PgStatPID = pgstat_start();
 
-			/* some workers may be scheduled to start now */
+			/* reset the state of crashed bgworkers so their restart is not delayed */
+			reset_bgworker_crash_state();
+
+			/* workers may be scheduled to start now */
 			maybe_start_bgworker();
 
 			/* at this point we are really open for business */
@@ -2845,14 +2849,8 @@ CleanupBackgroundWorker(int pid,
 		snprintf(namebuf, MAXPGPATH, "%s: %s", _("worker process"),
 				 rw->rw_worker.bgw_name);
 
-		/* Delay restarting any bgworker that exits with a nonzero status. */
-		if (!EXIT_STATUS_0(exitstatus))
-			rw->rw_crashed_at = GetCurrentTimestamp();
-		else
-			rw->rw_crashed_at = 0;
-
 		/*
-		 * Additionally, for shared-memory-connected workers, just like a
+		 * For shared-memory-connected workers, just like a
 		 * backend, any exit status other than 0 or 1 is considered a crash
 		 * and causes a system-wide restart.
 		 */
@@ -2860,21 +2858,21 @@ CleanupBackgroundWorker(int pid,
 		{
 			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
 			{
-				rw->rw_crashed_at = GetCurrentTimestamp();
+				rw->rw_crashed_at = 0;
 				HandleChildCrash(pid, exitstatus, namebuf);
 				return true;
 			}
 		}
 
-		if (!ReleasePostmasterChildSlot(rw->rw_child_slot))
-		{
-			/*
-			 * Uh-oh, the child failed to clean itself up.  Treat as a crash
-			 * after all.
-			 */
+		/* Mark any bgworker that exits with a nonzero status or fails to clean
+		 * up after itself as crashed and ready for restart at bgw_restart_time. */
+		if (!EXIT_STATUS_0(exitstatus) || !ReleasePostmasterChildSlot(rw->rw_child_slot))
 			rw->rw_crashed_at = GetCurrentTimestamp();
-			HandleChildCrash(pid, exitstatus, namebuf);
-			return true;
+		else
+		{
+			/* Exit status zero means terminate */
+			rw->rw_crashed_at = 0;
+			rw->rw_terminate = true;
 		}
 
 		/* Get it out of the BackendList and clear out remaining data */
@@ -5463,6 +5461,38 @@ assign_backendlist_entry(RegisteredBgWorker *rw)
 }
 
 /*
+ * Reset background worker crash state.
+ *
+ * Used after the crash-and-restart procedure to make sure
+ * all the background workers are started asap.
+ */
+static void
+reset_bgworker_crash_state(void)
+{
+	slist_mutable_iter	iter;
+
+	slist_foreach_modify(iter, &BackgroundWorkerList)
+	{
+		RegisteredBgWorker *rw;
+
+		rw = slist_container(RegisteredBgWorker, rw_lnode, iter.cur);
+
+		/*
+		 * We only care about background workers that are not running
+		 * and are not marked for removal
+		 */
+		if (rw->rw_pid != 0 || rw->rw_terminate)
+			continue;
+
+		/*
+		 * Reset the crash time so that subsequent maybe_start_bgworker
+		 * call will start the bgworker immediately
+		 */
+		rw->rw_crashed_at = 0;
+	}
+}
+
+/*
  * If the time is right, start one background worker.
  *
  * As a side effect, the bgworker control variables are set or reset whenever
diff --git a/src/include/postmaster/bgworker.h b/src/include/postmaster/bgworker.h
index c9550cc..d3dd1f2 100644
--- a/src/include/postmaster/bgworker.h
+++ b/src/include/postmaster/bgworker.h
@@ -16,10 +16,11 @@
  * that the failure can only be transient (fork failure due to high load,
  * memory pressure, too many processes, etc); more permanent problems, like
  * failure to connect to a database, are detected later in the worker and dealt
- * with just by having the worker exit normally.  A worker which exits with a
- * return code of 0 will be immediately restarted by the postmaster.  A worker
- * which exits with a return code of 1 will be restarted after the configured
- * restart interval, or never if that interval is set to BGW_NEVER_RESTART.
+ * with just by having the worker exit normally. A worker which exits with
+ * a return code of 0 will never be restarted and will be removed from worker
+ * list. A worker which exits with a return code of 1 will be restarted after
+ * the configured restart interval, or never if that interval is set to
+ * BGW_NEVER_RESTART.
  * The TerminateBackgroundWorker() function can be used to terminate a
  * dynamically registered background worker; the worker will be sent a SIGTERM
  * and will not be restarted after it exits.  Whenever the postmaster knows

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bgworker crashed or not?

Reply via email to