Re: [HACKERS] Apparent race condition in standby promotion

Noah Misch Tue, 03 May 2016 23:11:48 -0700

On Mon, Apr 25, 2016 at 02:09:27PM -0400, Tom Lane wrote:
> I'm looking at buildfarm member tern's recent failure:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-04-25%2001%3A08%3A08


> # Running: pg_ctl -D 
> /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/data_standby_fTzy/pgdata
>  -l 
> /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/log/001_basic_standby.log
>  promote
> server promoting
> # 
> Timed out while waiting for promotion of standby at RewindTest.pm line 166.
> 
> But look at what's in the standby's log:
> 
> LOG:  database system was interrupted; last known up at 2016-04-25 01:58:39 
> UTC
> LOG:  entering standby mode
> LOG:  redo starts at 0/2000028
> LOG:  consistent recovery state reached at 0/2000C00
> LOG:  database system is ready to accept read only connections
> LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
> LOG:  statement: SELECT NOT pg_is_in_recovery()
> ... 88 occurrences of that removed ...
> LOG:  statement: SELECT NOT pg_is_in_recovery()
> LOG:  received promote request
> FATAL:  terminating walreceiver process due to administrator command
> 
> The standby server didn't notice the promote request until AFTER
> the perl script had probed 90 times, once a second, for promotion
> to finish.  What's up with that?

This resembles symptoms I studied last September.

> The other theory is that the startup process received the SIGUSR2
> but failed to act on it for a long time.  It checks for that only
> in CheckForStandbyTrigger(), and the calls to that function are
> in assorted rather unprincipled places, which makes it hard to
> convince oneself that such a call will always happen soon.

Right.  In every case, I caught the startup process taking too long in a
single XLOG_DBASE_CREATE replay.  I don't suspect a bug; that is a costly
record to replay, especially given the machine's slow filesystem metadata
operations.  I could reproduce such symptoms reliably, on any system, with the
attached demonstration patch.  While we could hack things until the message
prints earlier, that won't make promotion finish earlier.

> We've seen previous examples of the same thing, eg
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-02-09%2019%3A16%3A09
> Noah raised the test's timeout from 30 seconds to 90 in response
> to that failure, but it looks to me like this was a mis-diagnosis
> of the problem.

I can understand you guessing the 2016-02-09 failure inspired that change, but
it was either a coincidence or merely reminded me to commit an already-written
change.  I had performed burn-in runs, with timeouts raised to 900s, totalling
25177 executions of src/bin/pg_rewind/t/002_databases.pl.  The longest promote
duration among those runs was 43s, so I doubled that quantity and rounded up.

For the 2016-04-25 failure, note this bit of 001_basic_master.log:

LOG:  statement: CHECKPOINT
LOG:  statement: SELECT pg_current_xlog_location() = write_location FROM 
pg_stat_replication WHERE application_name = 'rewind_standby';
... 39 duplicate lines removed ...
LOG:  statement: SELECT pg_current_xlog_location() = write_location FROM 
pg_stat_replication WHERE application_name = 'rewind_standby';
LOG:  received immediate shutdown request

That's 41s just to copy WAL.  The slowest of those 002_databases.pl runs used
7s, and the worst I've seen in 2579 runs of 001_basic.pl is 22s.  The failed
run likely coincided with exceptional system load.

If this happens again anytime soon, I should triple (or so) that 90s timeout.

diff --git a/src/backend/commands/dbcommands.c 
b/src/backend/commands/dbcommands.c
index 6cbe65e..8135a64 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -44,6 +44,7 @@
 #include "commands/defrem.h"
 #include "commands/seclabel.h"
 #include "commands/tablespace.h"
+#include "libpq/pqsignal.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -2085,6 +2086,11 @@ dbase_redo(XLogReaderState *record)
                 * We don't need to copy subdirectories
                 */
                copydir(src_path, dst_path, false);
+
+               /* Pause to tickle timeouts sensitive to recovery duration. */
+               PG_SETMASK(&BlockSig);
+               pg_usleep(91 * 1000000L);
+               PG_SETMASK(&UnBlockSig);
        }
        else if (info == XLOG_DBASE_DROP)
        {

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Apparent race condition in standby promotion

Reply via email to