Re: [HACKERS] pg_rewind tap test unstable

Noah Misch Sun, 06 Sep 2015 21:17:56 -0700

On Tue, Aug 04, 2015 at 02:21:16PM +0900, Michael Paquier wrote:
> >> On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <[email protected]> wrote:
> >> > for something between 10% and 20% of the devel builds for 
> >> > apt.postgresql.org
> >> > (which happen every 6h if there's a git change, so it happens every few 
> >> > days),
> >> > I'm seeing this:


> In test case 2, the failure happens to be that the standby did not
> have the time to replicate the database beforepromotion that has been
> created on the master. One possible explanation for this failure is
> that the standby has been promoted before all the WAL needed for the
> tests has been replayed, hence we had better be sure that the
> replay_location of the standby matches pg_current_xlog_location()
> before promotion.

> Perhaps the attached patch helps?

Thanks.  In light of your diagnosis, I can reliably reproduce the failure by
injecting a sleep into XLogSendPhysical().  Your patch fixes the problem, but
it adds wal_receiver_status_interval (= 10s) stalls, doubling
src/bin/pg_rewind/t/001_basic.pl runtime on a fast system.  (The standby
applies the final WAL quickly, then sleeps for wal_receiver_status_interval
before notifying the master.)  The standby will apply any written, unapplied
WAL during promotion.  Therefore, I plan to commit the attached
performance-neutral variant of your patch.

diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index 22e5cae..a4c1737 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -222,12 +222,8 @@ recovery_target_timeline='latest'
                                   '-l', "$log_path/standby.log",
                                   '-o', "-p $port_standby", 'start');
 
-       # Wait until the standby has caught up with the primary, by polling
-       # pg_stat_replication.
-       my $caughtup_query =
-"SELECT pg_current_xlog_location() = replay_location FROM pg_stat_replication 
WHERE application_name = 'rewind_standby';";
-       poll_query_until($caughtup_query, $connstr_master)
-         or die "Timed out while waiting for standby to catch up";
+       # The standby may have WAL to apply before it matches the primary.  That
+       # is fine, because no test examines the standby before promotion.
 }
 
 sub promote_standby
@@ -235,6 +231,12 @@ sub promote_standby
        #### Now run the test-specific parts to run after standby has been 
started
        # up standby
 
+       # Wait for the standby to receive and write all WAL.
+       my $wal_received_query =
+"SELECT pg_current_xlog_location() = write_location FROM pg_stat_replication 
WHERE application_name = 'rewind_standby';";
+       poll_query_until($wal_received_query, $connstr_master)
+         or die "Timed out while waiting for standby to receive and write WAL";
+
        # Now promote slave and insert some new data on master, this will put
        # the master out-of-sync with the standby. Wait until the standby is
        # out of recovery mode, and is ready to accept read-write connections.

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind tap test unstable

Reply via email to