Re: [HACKERS] pg_rewind tap test unstable

Michael Paquier Mon, 03 Aug 2015 22:22:15 -0700

On Mon, Aug 3, 2015 at 5:35 PM, Christoph Berg <[email protected]> wrote:
> Re: Michael Paquier 2015-07-28 
> <cab7npqqcpgy3u7cmfo8sqquobsfmeieohueslxwycc64j3g...@mail.gmail.com>
>> On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <[email protected]> wrote:
>> > for something between 10% and 20% of the devel builds for 
>> > apt.postgresql.org
>> > (which happen every 6h if there's a git change, so it happens every few 
>> > days),
>> > I'm seeing this:
>> > Dubious, test returned 1 (wstat 256, 0x100)
>> > Failed 1/8 subtests
>> >
>> > I don't have the older logs available, but from memory, the subtest
>> > failing and the two numbers mentioned are always the same.
>>
>> There should be some output logs in src/bin/pg_rewind/tmp_check/log/*?
>> Could you attach them here if you have them? That would be helpful to
>> understand what is happening.
>
> It took me a few attempts to tell the build environment to save a copy
> on failure and not shred everything right away. So here we go:


In test case 2, the failure happens to be that the standby did not
have the time to replicate the database beforepromotion that has been
created on the master. One possible explanation for this failure is
that the standby has been promoted before all the WAL needed for the
tests has been replayed, hence we had better be sure that the
replay_location of the standby matches pg_current_xlog_location()
before promotion. On the buildfarm, hamster, the legendary slowest
machine of the whole set, does not complain about that. Is your
environment that heavy loaded when you run the tests?

Perhaps the attached patch helps?
-- 
Michael

diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index b66ff0d..fce725f 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -232,6 +232,13 @@ sub promote_standby
 	#### Now run the test-specific parts to run after standby has been started
 	# up standby
 
+	# Wait until the standby has caught up with the primary, by polling
+	# pg_stat_replication.
+	my $caughtup_query =
+"SELECT pg_current_xlog_location() = replay_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';";
+	poll_query_until($caughtup_query, $connstr_master)
+	  or die "Timed out while waiting for standby to catch up";
+
 	# Now promote slave and insert some new data on master, this will put
 	# the master out-of-sync with the standby. Wait until the standby is
 	# out of recovery mode, and is ready to accept read-write connections.

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind tap test unstable

Reply via email to