On Thu, Mar 24, 2022 at 6:04 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <robertmh...@gmail.com> writes: > > Thanks, committed. > > Some of the buildfarm is seeing failures in the pg_checksums test.
Hmm. So the tests seem to be failing because 002_actions.pl stops the database cluster, runs pg_checksums (which passes), writes some zero bytes over the line pointer array of the first block of pg_class, and then runs pg_checksums again. In the failing buildfarm runs, pg_checksums fails to detect the corruption: the second run succeeds, while pg_checksums expects it to fail. That's pretty curious, because if the database cluster is stopped, and things are OK at that point, then how could a server bug of any kind cause a Perl script to be unable to corrupt a file on disk? A possible clue is that I also see a few machines failing in recoveryCheck. And the code that is failing there looks like this: # We've seen occasional cases where multiple walsender pids are active. An # immediate shutdown may hide evidence of a locking bug. So if multiple # walsenders are observed, shut down in fast mode, and collect some more # information. if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid")) { my ($stdout, $stderr); $node_primary3->psql('postgres', "\\a\\t\nSELECT * FROM pg_stat_activity", stdout => \$stdout, stderr => \$stderr); diag $stdout, $stderr; $node_primary3->stop('fast'); $node_standby3->stop('fast'); die "could not determine walsender pid, can't continue"; } And the failure looks like this: # Failed test 'have walsender pid 1047504 # 1047472' # at t/019_replslot_limit.pl line 343. That sure looks like there are multiple walsender PIDs active, and the pg_stat_activity output confirms it. 1047504 is running START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1 and 1047472 is running START_REPLICATION SLOT "pg_basebackup_1047472" 0/600000 TIMELINE 1. Both of these failures could possibly be explained by some failure of things to shut down properly, but it's not the same things. In the first case, the database server would have had to still be running after we run $node->stop, and it would have had to overwrite the bad contents of pg_class with some good contents. In the second case, the cluster's supposed to still be running, but the backends that were creating those replication slots should have exited sooner. I've been running the pg_checksums test in a loop here for a bit now in the hopes of being able to reproduce the failure, but it doesn't seem to want to fail here. And I've also looked over the commit and I can't quite see how it would cause a process, or the cluster, to fail to shutdown, unless perhaps it's the checkpointer that gets stuck, but that doesn't really seem to match the symptoms. Any ideas? -- Robert Haas EDB: http://www.enterprisedb.com