On Tue, Aug 04, 2015 at 02:21:16PM +0900, Michael Paquier wrote: > >> On Tue, Jul 28, 2015 at 5:57 PM, Christoph Berg <m...@debian.org> wrote: > >> > for something between 10% and 20% of the devel builds for > >> > apt.postgresql.org > >> > (which happen every 6h if there's a git change, so it happens every few > >> > days), > >> > I'm seeing this:
> In test case 2, the failure happens to be that the standby did not > have the time to replicate the database beforepromotion that has been > created on the master. One possible explanation for this failure is > that the standby has been promoted before all the WAL needed for the > tests has been replayed, hence we had better be sure that the > replay_location of the standby matches pg_current_xlog_location() > before promotion. > Perhaps the attached patch helps? Thanks. In light of your diagnosis, I can reliably reproduce the failure by injecting a sleep into XLogSendPhysical(). Your patch fixes the problem, but it adds wal_receiver_status_interval (= 10s) stalls, doubling src/bin/pg_rewind/t/001_basic.pl runtime on a fast system. (The standby applies the final WAL quickly, then sleeps for wal_receiver_status_interval before notifying the master.) The standby will apply any written, unapplied WAL during promotion. Therefore, I plan to commit the attached performance-neutral variant of your patch.
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm index 22e5cae..a4c1737 100644 --- a/src/bin/pg_rewind/RewindTest.pm +++ b/src/bin/pg_rewind/RewindTest.pm @@ -222,12 +222,8 @@ recovery_target_timeline='latest' '-l', "$log_path/standby.log", '-o', "-p $port_standby", 'start'); - # Wait until the standby has caught up with the primary, by polling - # pg_stat_replication. - my $caughtup_query = -"SELECT pg_current_xlog_location() = replay_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';"; - poll_query_until($caughtup_query, $connstr_master) - or die "Timed out while waiting for standby to catch up"; + # The standby may have WAL to apply before it matches the primary. That + # is fine, because no test examines the standby before promotion. } sub promote_standby @@ -235,6 +231,12 @@ sub promote_standby #### Now run the test-specific parts to run after standby has been started # up standby + # Wait for the standby to receive and write all WAL. + my $wal_received_query = +"SELECT pg_current_xlog_location() = write_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';"; + poll_query_until($wal_received_query, $connstr_master) + or die "Timed out while waiting for standby to receive and write WAL"; + # Now promote slave and insert some new data on master, this will put # the master out-of-sync with the standby. Wait until the standby is # out of recovery mode, and is ready to accept read-write connections.
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers