Hi, On 2022-03-24 15:05:40 -0400, Tom Lane wrote: > I just noticed something very interesting: in a couple of recent > buildfarm runs with this failure, the pg_stat_activity printout > no longer shows the extra walsender:
Oh. That is interesting. Thanks for catching that. > What this suggests to me is that maybe the extra walsender is > indeed not blocked on anything, but is just taking its time > about exiting. In these two runs, as well as in all the > non-failing runs, it had enough time to do so. Still odd that it started with the bugfix I committed. And the locking pattern in [1] still seems suspicious. But the obove does seem to suggest those might just have been red herrings. I had previously tried to make process exit slow by inserting delays in various points and only succeeded in tests failing due too many connections. For some reason I didn't increase max_connections... I now increased the test's max_connections and managed to get a kind of similar failure by sticking elog(LOG, "received disconnect"); pg_usleep(220000); into the 'X' case in PostgresMain(). However, I always see three pids or the expected one pid, not two like the BF cases. But it's not too surprising that such an artificial, uniform, slowdown would have different symptoms than reality. > I suggest that we add a couple-of-seconds sleep in front of > the query that collects walsender PIDs, and maybe a couple more > seconds before the pg_stat_activity probe in the failure path, > and see if the behavior changes at all. Makes sense. I was previously thinking that it'd make sense to sleep for a bit before the ->stop('fast'); calls, so that we can see whether it's just the shutdown unblocking things (which [1] seems to suggest). How about the attached variation, which retries (for 15s, with 100ms sleeps) if there are multiple walsenders, printing the debugging info each time? It'll still fail the test if later iterations find just one walsender, which seems the right behaviour for now. > That should be enough to confirm or disprove this idea pretty quickly. If > it is right, a permanent fix could be to wait for the basebackup's walsender > to disappear from node_primary3's pg_stat_activity before we start the one > for node_standby3. For some tests a "confimed disconnect" mode would be useful. Basically the client waiting until it receives EOF after sending 'X'. Greetings, Andres Freund [1] https://postgr.es/m/20220226072325.wtvkwvvga2wc3nkn%40alap3.anarazel.de
diff --git i/src/test/recovery/t/019_replslot_limit.pl w/src/test/recovery/t/019_replslot_limit.pl index f62b7b32f66..77bb401bc5f 100644 --- i/src/test/recovery/t/019_replslot_limit.pl +++ w/src/test/recovery/t/019_replslot_limit.pl @@ -333,23 +333,41 @@ $node_standby3->init_from_backup($node_primary3, $backup_name, $node_standby3->append_conf('postgresql.conf', "primary_slot_name = 'rep3'"); $node_standby3->start; $node_primary3->wait_for_catchup($node_standby3); -my $senderpid = $node_primary3->safe_psql('postgres', - "SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'"); -# We've seen occasional cases where multiple walsender pids are active. An -# immediate shutdown may hide evidence of a locking bug. So if multiple -# walsenders are observed, shut down in fast mode, and collect some more -# information. -if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid")) +my $senderpid; + +# We've seen occasional cases where multiple walsender pids are active. It +# could be that we're just observing process shutdown being slow. To collect +# more information, retry a couple times, print a bit of debugging information +# each iteration. For now report a test failure even if later iterations +# succeed. +my $i = 0; +while (1) { my ($stdout, $stderr); + + $senderpid = $node_primary3->safe_psql('postgres', + "SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'"); + + last if like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"); + + # show information about all active connections $node_primary3->psql('postgres', "\\a\\t\nSELECT * FROM pg_stat_activity", stdout => \$stdout, stderr => \$stderr); diag $stdout, $stderr; - $node_primary3->stop('fast'); - $node_standby3->stop('fast'); - die "could not determine walsender pid, can't continue"; + + # unlikely that the problem would resolve after 15s, so give up at point + if ($i++ == 150) + { + # An immediate shutdown may hide evidence of a locking bug. If + # retrying didn't resolve the issue, shut down in fast mode. + $node_primary3->stop('fast'); + $node_standby3->stop('fast'); + die "could not determine walsender pid, can't continue"; + } + + usleep(100_000); } my $receiverpid = $node_standby3->safe_psql('postgres',