Re: Corruption during WAL replay

Robert Haas Thu, 24 Mar 2022 17:37:55 -0700

On Thu, Mar 24, 2022 at 6:04 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
> > Thanks, committed.
>
> Some of the buildfarm is seeing failures in the pg_checksums test.


Hmm. So the tests seem to be failing because 002_actions.pl stops the
database cluster, runs pg_checksums (which passes), writes some zero
bytes over the line pointer array of the first block of pg_class, and
then runs pg_checksums again. In the failing buildfarm runs,
pg_checksums fails to detect the corruption: the second run succeeds,
while pg_checksums expects it to fail. That's pretty curious, because
if the database cluster is stopped, and things are OK at that point,
then how could a server bug of any kind cause a Perl script to be
unable to corrupt a file on disk?

A possible clue is that I also see a few machines failing in
recoveryCheck. And the code that is failing there looks like this:

# We've seen occasional cases where multiple walsender pids are active. An
# immediate shutdown may hide evidence of a locking bug. So if multiple
# walsenders are observed, shut down in fast mode, and collect some more
# information.
if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"))
{
    my ($stdout, $stderr);
    $node_primary3->psql('postgres',
                         "\\a\\t\nSELECT * FROM pg_stat_activity",
                         stdout => \$stdout, stderr => \$stderr);
    diag $stdout, $stderr;
    $node_primary3->stop('fast');
    $node_standby3->stop('fast');
    die "could not determine walsender pid, can't continue";
}

And the failure looks like this:

#   Failed test 'have walsender pid 1047504
# 1047472'
#   at t/019_replslot_limit.pl line 343.

That sure looks like there are multiple walsender PIDs active, and the
pg_stat_activity output confirms it. 1047504 is running
START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1 and 1047472 is
running START_REPLICATION SLOT "pg_basebackup_1047472" 0/600000
TIMELINE 1.

Both of these failures could possibly be explained by some failure of
things to shut down properly, but it's not the same things. In the
first case, the database server would have had to still be running
after we run $node->stop, and it would have had to overwrite the bad
contents of pg_class with some good contents. In the second case, the
cluster's supposed to still be running, but the backends that were
creating those replication slots should have exited sooner.

I've been running the pg_checksums test in a loop here for a bit now
in the hopes of being able to reproduce the failure, but it doesn't
seem to want to fail here. And I've also looked over the commit and I
can't quite see how it would cause a process, or the cluster, to fail
to shutdown, unless perhaps it's the checkpointer that gets stuck, but
that doesn't really seem to match the symptoms.

Any ideas?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Corruption during WAL replay

Reply via email to