On 07/05/14 02:25, Petr Jelinek wrote:
On 06/05/14 19:05, Robert Haas wrote:
What I'm inclined to do is change the logic so that:
(1) After a crash-and-restart sequence, zero rw->rw_crashed_at, so
that anything which is still registered gets restarted immediately.
Yes, that's quite obvious change which I missed completely :).
(2) If a shmem-connected backend fails to release the deadman switch
or exits with an exit code other than 0 or 1, we crash-and-restart. A
non-shmem-connected backend never causes a crash-and-restart.
+1
(3) When a background worker exits without triggering a
crash-and-restart, an exit code of precisely 0 causes the worker to be
unregistered; any other exit code has no special effect, so
bgw_restart_time controls.
+1
Ok, attached patch is my try at the proposed changes.
I don't like the reset_bgworker_crash_state function name, maybe you'll
come up with something better...
I passes my tests for the desired behavior, hopefully I didn't miss some
scenario.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index a6b25d8..e913395 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -472,13 +472,13 @@ bgworker_quickdie(SIGNAL_ARGS)
on_exit_reset();
/*
- * Note we do exit(0) here, not exit(2) like quickdie. The reason is that
- * we don't want to be seen this worker as independently crashed, because
- * then postmaster would delay restarting it again afterwards. If some
- * idiot DBA manually sends SIGQUIT to a random bgworker, the "dead man
- * switch" will ensure that postmaster sees this as a crash.
+ * Note we do exit(2) here, not exit(0) for same reasons quickdie does.
+ * Another reason is that bgworker is not restarted after exit(0).
+ *
+ * The bgw_restart_time does not apply here and the bgworker will be
+ * restarted immediately.
*/
- exit(0);
+ exit(2);
}
/*
@@ -832,8 +832,9 @@ RegisterDynamicBackgroundWorker(BackgroundWorker *worker,
* running but is no longer.
*
* In the latter case, the worker may be stopped temporarily (if it is
- * configured for automatic restart, or if it exited with code 0) or gone
- * for good (if it is configured not to restart and exited with code 1).
+ * configured for automatic restart and exited with code 1) or gone for
+ * good (if it exited with code other than 1 or if it is configured not
+ * to restart).
*/
BgwHandleStatus
GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pidp)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6d09887..99d5e85 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -396,6 +396,7 @@ static void TerminateChildren(int signal);
static int CountChildren(int target);
static int CountUnconnectedWorkers(void);
+static void reset_bgworker_crash_state(void);
static void maybe_start_bgworker(void);
static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
static pid_t StartChildProcess(AuxProcType type);
@@ -2616,7 +2617,10 @@ reaper(SIGNAL_ARGS)
if (PgStatPID == 0)
PgStatPID = pgstat_start();
- /* some workers may be scheduled to start now */
+ /* reset the state of crashed bgworkers so their restart is not delayed */
+ reset_bgworker_crash_state();
+
+ /* workers may be scheduled to start now */
maybe_start_bgworker();
/* at this point we are really open for business */
@@ -2845,14 +2849,8 @@ CleanupBackgroundWorker(int pid,
snprintf(namebuf, MAXPGPATH, "%s: %s", _("worker process"),
rw->rw_worker.bgw_name);
- /* Delay restarting any bgworker that exits with a nonzero status. */
- if (!EXIT_STATUS_0(exitstatus))
- rw->rw_crashed_at = GetCurrentTimestamp();
- else
- rw->rw_crashed_at = 0;
-
/*
- * Additionally, for shared-memory-connected workers, just like a
+ * For shared-memory-connected workers, just like a
* backend, any exit status other than 0 or 1 is considered a crash
* and causes a system-wide restart.
*/
@@ -2860,21 +2858,21 @@ CleanupBackgroundWorker(int pid,
{
if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
{
- rw->rw_crashed_at = GetCurrentTimestamp();
+ rw->rw_crashed_at = 0;
HandleChildCrash(pid, exitstatus, namebuf);
return true;
}
}
- if (!ReleasePostmasterChildSlot(rw->rw_child_slot))
- {
- /*
- * Uh-oh, the child failed to clean itself up. Treat as a crash
- * after all.
- */
+ /* Mark any bgworker that exits with a nonzero status or fails to clean
+ * up after itself as crashed and ready for restart at bgw_restart_time. */
+ if (!EXIT_STATUS_0(exitstatus) || !ReleasePostmasterChildSlot(rw->rw_child_slot))
rw->rw_crashed_at = GetCurrentTimestamp();
- HandleChildCrash(pid, exitstatus, namebuf);
- return true;
+ else
+ {
+ /* Exit status zero means terminate */
+ rw->rw_crashed_at = 0;
+ rw->rw_terminate = true;
}
/* Get it out of the BackendList and clear out remaining data */
@@ -5463,6 +5461,38 @@ assign_backendlist_entry(RegisteredBgWorker *rw)
}
/*
+ * Reset background worker crash state.
+ *
+ * Used after the crash-and-restart procedure to make sure
+ * all the background workers are started asap.
+ */
+static void
+reset_bgworker_crash_state(void)
+{
+ slist_mutable_iter iter;
+
+ slist_foreach_modify(iter, &BackgroundWorkerList)
+ {
+ RegisteredBgWorker *rw;
+
+ rw = slist_container(RegisteredBgWorker, rw_lnode, iter.cur);
+
+ /*
+ * We only care about background workers that are not running
+ * and are not marked for removal
+ */
+ if (rw->rw_pid != 0 || rw->rw_terminate)
+ continue;
+
+ /*
+ * Reset the crash time so that subsequent maybe_start_bgworker
+ * call will start the bgworker immediately
+ */
+ rw->rw_crashed_at = 0;
+ }
+}
+
+/*
* If the time is right, start one background worker.
*
* As a side effect, the bgworker control variables are set or reset whenever
diff --git a/src/include/postmaster/bgworker.h b/src/include/postmaster/bgworker.h
index c9550cc..d3dd1f2 100644
--- a/src/include/postmaster/bgworker.h
+++ b/src/include/postmaster/bgworker.h
@@ -16,10 +16,11 @@
* that the failure can only be transient (fork failure due to high load,
* memory pressure, too many processes, etc); more permanent problems, like
* failure to connect to a database, are detected later in the worker and dealt
- * with just by having the worker exit normally. A worker which exits with a
- * return code of 0 will be immediately restarted by the postmaster. A worker
- * which exits with a return code of 1 will be restarted after the configured
- * restart interval, or never if that interval is set to BGW_NEVER_RESTART.
+ * with just by having the worker exit normally. A worker which exits with
+ * a return code of 0 will never be restarted and will be removed from worker
+ * list. A worker which exits with a return code of 1 will be restarted after
+ * the configured restart interval, or never if that interval is set to
+ * BGW_NEVER_RESTART.
* The TerminateBackgroundWorker() function can be used to terminate a
* dynamically registered background worker; the worker will be sent a SIGTERM
* and will not be restarted after it exits. Whenever the postmaster knows
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers