On Mon, Apr 25, 2016 at 02:09:27PM -0400, Tom Lane wrote:
> I'm looking at buildfarm member tern's recent failure:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-04-25%2001%3A08%3A08
> # Running: pg_ctl -D
> /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/data_standby_fTzy/pgdata
> -l
> /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/log/001_basic_standby.log
> promote
> server promoting
> #
> Timed out while waiting for promotion of standby at RewindTest.pm line 166.
>
> But look at what's in the standby's log:
>
> LOG: database system was interrupted; last known up at 2016-04-25 01:58:39
> UTC
> LOG: entering standby mode
> LOG: redo starts at 0/2000028
> LOG: consistent recovery state reached at 0/2000C00
> LOG: database system is ready to accept read only connections
> LOG: started streaming WAL from primary at 0/3000000 on timeline 1
> LOG: statement: SELECT NOT pg_is_in_recovery()
> ... 88 occurrences of that removed ...
> LOG: statement: SELECT NOT pg_is_in_recovery()
> LOG: received promote request
> FATAL: terminating walreceiver process due to administrator command
>
> The standby server didn't notice the promote request until AFTER
> the perl script had probed 90 times, once a second, for promotion
> to finish. What's up with that?
This resembles symptoms I studied last September.
> The other theory is that the startup process received the SIGUSR2
> but failed to act on it for a long time. It checks for that only
> in CheckForStandbyTrigger(), and the calls to that function are
> in assorted rather unprincipled places, which makes it hard to
> convince oneself that such a call will always happen soon.
Right. In every case, I caught the startup process taking too long in a
single XLOG_DBASE_CREATE replay. I don't suspect a bug; that is a costly
record to replay, especially given the machine's slow filesystem metadata
operations. I could reproduce such symptoms reliably, on any system, with the
attached demonstration patch. While we could hack things until the message
prints earlier, that won't make promotion finish earlier.
> We've seen previous examples of the same thing, eg
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-02-09%2019%3A16%3A09
> Noah raised the test's timeout from 30 seconds to 90 in response
> to that failure, but it looks to me like this was a mis-diagnosis
> of the problem.
I can understand you guessing the 2016-02-09 failure inspired that change, but
it was either a coincidence or merely reminded me to commit an already-written
change. I had performed burn-in runs, with timeouts raised to 900s, totalling
25177 executions of src/bin/pg_rewind/t/002_databases.pl. The longest promote
duration among those runs was 43s, so I doubled that quantity and rounded up.
For the 2016-04-25 failure, note this bit of 001_basic_master.log:
LOG: statement: CHECKPOINT
LOG: statement: SELECT pg_current_xlog_location() = write_location FROM
pg_stat_replication WHERE application_name = 'rewind_standby';
... 39 duplicate lines removed ...
LOG: statement: SELECT pg_current_xlog_location() = write_location FROM
pg_stat_replication WHERE application_name = 'rewind_standby';
LOG: received immediate shutdown request
That's 41s just to copy WAL. The slowest of those 002_databases.pl runs used
7s, and the worst I've seen in 2579 runs of 001_basic.pl is 22s. The failed
run likely coincided with exceptional system load.
If this happens again anytime soon, I should triple (or so) that 90s timeout.
diff --git a/src/backend/commands/dbcommands.c
b/src/backend/commands/dbcommands.c
index 6cbe65e..8135a64 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -44,6 +44,7 @@
#include "commands/defrem.h"
#include "commands/seclabel.h"
#include "commands/tablespace.h"
+#include "libpq/pqsignal.h"
#include "mb/pg_wchar.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -2085,6 +2086,11 @@ dbase_redo(XLogReaderState *record)
* We don't need to copy subdirectories
*/
copydir(src_path, dst_path, false);
+
+ /* Pause to tickle timeouts sensitive to recovery duration. */
+ PG_SETMASK(&BlockSig);
+ pg_usleep(91 * 1000000L);
+ PG_SETMASK(&UnBlockSig);
}
else if (info == XLOG_DBASE_DROP)
{
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers