Re: Race conditions in 019_replslot_limit.pl

Andres Freund Thu, 24 Mar 2022 17:02:55 -0700

Hi,

On 2022-03-24 15:05:40 -0400, Tom Lane wrote:
> I just noticed something very interesting: in a couple of recent
> buildfarm runs with this failure, the pg_stat_activity printout
> no longer shows the extra walsender:


Oh. That is interesting. Thanks for catching that.


> What this suggests to me is that maybe the extra walsender is
> indeed not blocked on anything, but is just taking its time
> about exiting.  In these two runs, as well as in all the
> non-failing runs, it had enough time to do so.

Still odd that it started with the bugfix I committed. And the locking pattern
in [1] still seems suspicious. But the obove does seem to suggest those might
just have been red herrings.


I had previously tried to make process exit slow by inserting delays in
various points and only succeeded in tests failing due too many
connections. For some reason I didn't increase max_connections...

I now increased the test's max_connections and managed to get a kind of
similar failure by sticking
                                elog(LOG, "received disconnect");
                                pg_usleep(220000);
into the 'X' case in PostgresMain().

However, I always see three pids or the expected one pid, not two like the BF
cases. But it's not too surprising that such an artificial, uniform, slowdown
would have different symptoms than reality.


> I suggest that we add a couple-of-seconds sleep in front of
> the query that collects walsender PIDs, and maybe a couple more
> seconds before the pg_stat_activity probe in the failure path,
> and see if the behavior changes at all.

Makes sense. I was previously thinking that it'd make sense to sleep for a bit
before the ->stop('fast'); calls, so that we can see whether it's just the
shutdown unblocking things (which [1] seems to suggest).

How about the attached variation, which retries (for 15s, with 100ms sleeps)
if there are multiple walsenders, printing the debugging info each time? It'll
still fail the test if later iterations find just one walsender, which seems
the right behaviour for now.


> That should be enough to confirm or disprove this idea pretty quickly.  If
> it is right, a permanent fix could be to wait for the basebackup's walsender
> to disappear from node_primary3's pg_stat_activity before we start the one
> for node_standby3.

For some tests a "confimed disconnect" mode would be useful. Basically the
client waiting until it receives EOF after sending 'X'.

Greetings,

Andres Freund

[1] https://postgr.es/m/20220226072325.wtvkwvvga2wc3nkn%40alap3.anarazel.de

diff --git i/src/test/recovery/t/019_replslot_limit.pl w/src/test/recovery/t/019_replslot_limit.pl
index f62b7b32f66..77bb401bc5f 100644
--- i/src/test/recovery/t/019_replslot_limit.pl
+++ w/src/test/recovery/t/019_replslot_limit.pl
@@ -333,23 +333,41 @@ $node_standby3->init_from_backup($node_primary3, $backup_name,
 $node_standby3->append_conf('postgresql.conf', "primary_slot_name = 'rep3'");
 $node_standby3->start;
 $node_primary3->wait_for_catchup($node_standby3);
-my $senderpid = $node_primary3->safe_psql('postgres',
-	"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'");
 
-# We've seen occasional cases where multiple walsender pids are active. An
-# immediate shutdown may hide evidence of a locking bug. So if multiple
-# walsenders are observed, shut down in fast mode, and collect some more
-# information.
-if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"))
+my $senderpid;
+
+# We've seen occasional cases where multiple walsender pids are active. It
+# could be that we're just observing process shutdown being slow. To collect
+# more information, retry a couple times, print a bit of debugging information
+# each iteration. For now report a test failure even if later iterations
+# succeed.
+my $i = 0;
+while (1)
 {
 	my ($stdout, $stderr);
+
+	$senderpid = $node_primary3->safe_psql('postgres',
+	    "SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'");
+
+	last if like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid");
+
+	# show information about all active connections
 	$node_primary3->psql('postgres',
 						 "\\a\\t\nSELECT * FROM pg_stat_activity",
 						 stdout => \$stdout, stderr => \$stderr);
 	diag $stdout, $stderr;
-	$node_primary3->stop('fast');
-	$node_standby3->stop('fast');
-	die "could not determine walsender pid, can't continue";
+
+	# unlikely that the problem would resolve after 15s, so give up at point
+	if ($i++ == 150)
+	{
+		# An immediate shutdown may hide evidence of a locking bug. If
+		# retrying didn't resolve the issue, shut down in fast mode.
+		$node_primary3->stop('fast');
+		$node_standby3->stop('fast');
+		die "could not determine walsender pid, can't continue";
+	}
+
+	usleep(100_000);
 }
 
 my $receiverpid = $node_standby3->safe_psql('postgres',

Re: Race conditions in 019_replslot_limit.pl

Reply via email to