Re: PANIC during crash recovery of a recently promoted standby

Michael Paquier Wed, 27 Jun 2018 18:38:56 -0700

Adding Heikki and Andres in CC here for awareness..

On Wed, Jun 27, 2018 at 05:29:38PM +0900, Michael Paquier wrote:
> I have spent a bit of time testing this on HEAD, 10 and 9.6.  For 9.5,
> 9.4 and 9.3 I have reproduced the failure and tested the patch, but I
> lacked time to perform more tests.  The patch set for 9.3~9.5 applies
> without conflict across the 3 branches.  9.6 has a conflict in a
> comment, and v10 had an extra comment conflict.
> 
> Feel free to have a look, I am not completely done with this stuff and
> I'll work more tomorrow on checking 9.3~9.5.


And I have been able to spend the time I wanted to spend on this patch
series with testing for 9.3 to 9.5.  Attached are a couple of patches
you can use to reproduce the failures for all the branches:
- For master and 10, the tests are included in the patch and are
proposed for commit.
- On 9.6, I had to tweak the TAP scripts as pg_ctl start has switched to
use the wait mode by default.
- On 9.5, there is a tweak to src/Makefile.global.in which cleans up
tmp_check, and a couple of GUCs not compatible.
- On 9.4, I had to tweak src/Makefile.global.in so as the temporary
installation path is correct.  Again some GUCs had to be tweaked.
- On 9.3, there is no TAP infrastructure, so I tweaked
src/test/recovery/Makefile to be able to run the tests.

I have also created a bash script which emulates what the TAP test does,
which is attached.  Because of visibly some timing reasons, I have not
been able to reproduce the problem with it.  Anyway, running (and
actually sort of back-porting) the TAP suite so as the problematic test
case can be run is possible with the sets attached and shows the failure
so we can use that.

Thoughts?  I would love more input about the patch concept.
--
Michael

#!/bin/bash

killall postgres
sleep 1

PSQL="psql -X"
INITDB="initdb"
DATA_PRIMARY=$HOME/data/primary
DATA_STANDBY=$HOME/data/standby
PORT_PRIMARY=5432
PORT_STANDBY=5433

rm -rf $DATA_PRIMARY $DATA_STANDBY

# Initialize a primary with one standby.
$INITDB -D $DATA_PRIMARY

cat >> $DATA_PRIMARY/postgresql.conf <<EOF
wal_log_hints = off
wal_level = replica
restart_after_crash = off
max_wal_senders = 5
hot_standby = on
fsync = off
port = $PORT_PRIMARY
logging_collector = on
log_directory = 'log'
max_wal_size = 128MB
shared_buffers = 1GB
max_connections = 10
log_min_messages = debug1
log_min_error_statement = debug1
log_checkpoints = on
EOF

cat >> $DATA_PRIMARY/pg_hba.conf <<EOF
host replication ioltas 127.0.0.1/32 trust
host replication ioltas ::1/128 trust
local replication ioltas trust
EOF

pg_ctl start -w -D $DATA_PRIMARY
createdb $USER

pg_basebackup -D $DATA_STANDBY -p $PORT_PRIMARY
cat >> $DATA_STANDBY/postgresql.conf <<EOF
port = $PORT_STANDBY
checkpoint_timeout = 1h
checkpoint_completion_target = 0.9
EOF
cat >> $DATA_STANDBY/recovery.conf <<EOF
standby_mode=on
primary_conninfo = 'port=5432 user=ioltas'
EOF
pg_ctl start -w -D $DATA_STANDBY

# Dummy table for the upcoming tests.
$PSQL -p $PORT_PRIMARY -c 'create table test1 (a int);'
$PSQL -p $PORT_PRIMARY -c 'insert into test1 select generate_series(1, 10000);'
$PSQL -p $PORT_PRIMARY -c 'checkpoint;'
# The following vacuum will set visibility map bits and create
# problematic WAL records.
$PSQL -p $PORT_PRIMARY -c 'vacuum verbose test1;'

# Wait for standby to replay.
sleep 5

# Now force a checkpoint on the standby. This seems unnecessary but for "some"
# reason, the previous checkpoint on the primary does not reflect on the standby
# and without an explicit checkpoint, it may start redo recovery from a much
# older point, which includes even create table and initial page additions.
$PSQL -p $PORT_STANDBY -c 'checkpoint;'

# Now just use a dummy table and run some operations to move minRecoveryPoint
# beyond the previous vacuum.
$PSQL -p $PORT_PRIMARY -c 'create table test2 (a int, b text);'
$PSQL -p $PORT_PRIMARY -c 'insert into test2 select generate_series(1,10000), 
md5(random()::text)'
$PSQL -p $PORT_PRIMARY -c 'truncate test2;'

# Wait again for replay on standby side.
sleep 5

# Promote standby and wait for it to finish.
pg_ctl promote -w -D $DATA_STANDBY
sleep 5

# Truncate the table on the promoted standby, vacuum and extend it
# again to create new page references.  The first post-recovery checkpoint
# has not happened yet
$PSQL -p $PORT_STANDBY -c 'truncate test1;'
$PSQL -p $PORT_STANDBY -c 'vacuum verbose test1;'
$PSQL -p $PORT_STANDBY -c 'insert into test1 select generate_series(1,1000);'

pg_ctl -w --mode immediate -D $DATA_STANDBY stop
# Crash should happen here
pg_ctl -w -D $DATA_STANDBY start

# Wait for recovery to finish
sleep 5

$PSQL -p $PORT_STANDBY -c 'SELECT count(*) FROM test1'

promote-panic-test-93.tar.gz
Description: application/gzip

promote-panic-test-94.tar.gz
Description: application/gzip

promote-panic-test-95.tar.gz
Description: application/gzip

promote-panic-test-96.tar.gz
Description: application/gzip

signature.asc
Description: PGP signature

Re: PANIC during crash recovery of a recently promoted standby

Reply via email to