On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:
> Of late, by far the majority of the random-noise failures we see in the
> buildfarm have come from failure to shut down the postmaster in a
> reasonable timeframe.

> We've seen variants
> on this theme on half a dozen machines just in the past week --- and it
> seems to mostly happen in 9.5 and HEAD, which is fishy.

It has been affecting only the four AIX animals, which do share hardware.
(Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)  I
agree the concentration on 9.5 and HEAD is suspicious; while those branches
get the most buildfarm runs, that factor by itself doesn't explain the
distribution of failures among versions.

> What I'd like to do to investigate this is put in a temporary HEAD-only
> patch that makes ShutdownXLOG() and its subroutines much chattier about
> how far they've gotten and what time it is, and also makes pg_ctl print
> out the current time if it gives up waiting.  A few failed runs with
> that in place will at least allow us to confirm or deny whether it's
> just that the shutdown checkpoint is sometimes really slow, or whether
> there's a bug lurking.
> Any objections?  Anybody have another idea for data to collect?

That's reasonable.  If you would like higher-fidelity data, I can run loops of
"pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run
that for HEAD and 9.2 simultaneously.  A day of logs from that should show
clearly if HEAD is systematically worse than 9.2.  By the way, you would
almost surely qualify for an account on this machine.

I had drafted the following message and patch last week, and I suppose it
belongs in this thread:

On Mon, Oct 12, 2015 at 06:41:06PM -0400, Tom Lane wrote:
> I'm not sure if this will completely fix our problems with "pg_ctl start"
> related buildfarm failures on very slow critters.  It does get rid of the
> hard wired 5-second timeout, but the 60-second timeout could still be an
> issue.  I think Noah was considering a patch to allow that number to be
> raised.  I'd be in favor of letting pg_ctl accept a default timeout length
> from an environment variable, and then the slower critters could be fixed
> by adjusting their buildfarm configurations.

Your commit 6bcce25 made src/bin test suites stop failing due to pg_ctl
startup timeouts, but other suites have been failing on the AIX buildfarm zoo
due to slow shutdown.  Example taking 72s to even reach ShutdownXLOG():

So, I wish to raise the timeout for those animals.  Using an environment
variable was a good idea; it's one less thing for test authors to remember.
Since the variable affects a performance-related fudge factor rather than
change behavior per se, I'm less skittish than usual about unintended
consequences of dynamic scope.  (With said unintended consequences in mind, I
made "pg_ctl register" ignore PGCTLTIMEOUT rather than embed its value into
the service created.)

diff --git a/doc/src/sgml/ref/pg_ctl-ref.sgml b/doc/src/sgml/ref/pg_ctl-ref.sgml
index eaa0cc8..6ceb781 100644
--- a/doc/src/sgml/ref/pg_ctl-ref.sgml
+++ b/doc/src/sgml/ref/pg_ctl-ref.sgml
@@ -362,7 +362,9 @@ PostgreSQL documentation
         The maximum number of seconds to wait when waiting for startup or
-        shutdown to complete.  The default is 60 seconds.
+        shutdown to complete.  Defaults to the value of the
+        <envar>PGCTLTIMEOUT</> environment variable or, if not set, to 60
+        seconds.
@@ -487,6 +489,17 @@ PostgreSQL documentation
+    <term><envar>PGCTLTIMEOUT</envar></term>
+    <listitem>
+     <para>
+      Default limit on the number of seconds to wait when waiting for startup
+      or shutdown to complete.  If not set, the default is 60 seconds.
+     </para>
+    </listitem>
+   </varlistentry>
+   <varlistentry>
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 9da38c4..bae6c22 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -72,6 +72,7 @@ typedef enum
 static bool do_wait = false;
 static bool wait_set = false;
 static int     wait_seconds = DEFAULT_WAIT;
+static bool wait_seconds_arg = false;
 static bool silent_mode = false;
 static ShutdownMode shutdown_mode = FAST_MODE;
 static int     sig = SIGINT;           /* default */
@@ -1431,7 +1432,8 @@ pgwin32_CommandLine(bool registration)
        if (registration && do_wait)
                appendPQExpBuffer(cmdLine, " -w");
-       if (registration && wait_seconds != DEFAULT_WAIT)
+       /* Don't propagate a value from an environment variable. */
+       if (registration && wait_seconds_arg && wait_seconds != DEFAULT_WAIT)
                appendPQExpBuffer(cmdLine, " -t %d", wait_seconds);
        if (registration && silent_mode)
@@ -2128,6 +2130,7 @@ main(int argc, char **argv)
                {NULL, 0, NULL, 0}
+       char       *env_wait;
        int                     option_index;
        int                     c;
        pgpid_t         killproc = 0;
@@ -2178,6 +2181,10 @@ main(int argc, char **argv)
+       env_wait = getenv("PGCTLTIMEOUT");
+       if (env_wait != NULL)
+               wait_seconds = atoi(env_wait);
         * 'Action' can be before or after args so loop over both. Some
         * getopt_long() implementations will reorder argv[] to place all flags
@@ -2255,6 +2262,7 @@ main(int argc, char **argv)
                                case 't':
                                        wait_seconds = atoi(optarg);
+                                       wait_seconds_arg = true;
                                case 'U':
                                        if (strchr(optarg, '\\'))
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to