Adding Heikki and Andres in CC here for awareness.. On Wed, Jun 27, 2018 at 05:29:38PM +0900, Michael Paquier wrote: > I have spent a bit of time testing this on HEAD, 10 and 9.6. For 9.5, > 9.4 and 9.3 I have reproduced the failure and tested the patch, but I > lacked time to perform more tests. The patch set for 9.3~9.5 applies > without conflict across the 3 branches. 9.6 has a conflict in a > comment, and v10 had an extra comment conflict. > > Feel free to have a look, I am not completely done with this stuff and > I'll work more tomorrow on checking 9.3~9.5.
And I have been able to spend the time I wanted to spend on this patch series with testing for 9.3 to 9.5. Attached are a couple of patches you can use to reproduce the failures for all the branches: - For master and 10, the tests are included in the patch and are proposed for commit. - On 9.6, I had to tweak the TAP scripts as pg_ctl start has switched to use the wait mode by default. - On 9.5, there is a tweak to src/Makefile.global.in which cleans up tmp_check, and a couple of GUCs not compatible. - On 9.4, I had to tweak src/Makefile.global.in so as the temporary installation path is correct. Again some GUCs had to be tweaked. - On 9.3, there is no TAP infrastructure, so I tweaked src/test/recovery/Makefile to be able to run the tests. I have also created a bash script which emulates what the TAP test does, which is attached. Because of visibly some timing reasons, I have not been able to reproduce the problem with it. Anyway, running (and actually sort of back-porting) the TAP suite so as the problematic test case can be run is possible with the sets attached and shows the failure so we can use that. Thoughts? I would love more input about the patch concept. -- Michael
#!/bin/bash killall postgres sleep 1 PSQL="psql -X" INITDB="initdb" DATA_PRIMARY=$HOME/data/primary DATA_STANDBY=$HOME/data/standby PORT_PRIMARY=5432 PORT_STANDBY=5433 rm -rf $DATA_PRIMARY $DATA_STANDBY # Initialize a primary with one standby. $INITDB -D $DATA_PRIMARY cat >> $DATA_PRIMARY/postgresql.conf <<EOF wal_log_hints = off wal_level = replica restart_after_crash = off max_wal_senders = 5 hot_standby = on fsync = off port = $PORT_PRIMARY logging_collector = on log_directory = 'log' max_wal_size = 128MB shared_buffers = 1GB max_connections = 10 log_min_messages = debug1 log_min_error_statement = debug1 log_checkpoints = on EOF cat >> $DATA_PRIMARY/pg_hba.conf <<EOF host replication ioltas 127.0.0.1/32 trust host replication ioltas ::1/128 trust local replication ioltas trust EOF pg_ctl start -w -D $DATA_PRIMARY createdb $USER pg_basebackup -D $DATA_STANDBY -p $PORT_PRIMARY cat >> $DATA_STANDBY/postgresql.conf <<EOF port = $PORT_STANDBY checkpoint_timeout = 1h checkpoint_completion_target = 0.9 EOF cat >> $DATA_STANDBY/recovery.conf <<EOF standby_mode=on primary_conninfo = 'port=5432 user=ioltas' EOF pg_ctl start -w -D $DATA_STANDBY # Dummy table for the upcoming tests. $PSQL -p $PORT_PRIMARY -c 'create table test1 (a int);' $PSQL -p $PORT_PRIMARY -c 'insert into test1 select generate_series(1, 10000);' $PSQL -p $PORT_PRIMARY -c 'checkpoint;' # The following vacuum will set visibility map bits and create # problematic WAL records. $PSQL -p $PORT_PRIMARY -c 'vacuum verbose test1;' # Wait for standby to replay. sleep 5 # Now force a checkpoint on the standby. This seems unnecessary but for "some" # reason, the previous checkpoint on the primary does not reflect on the standby # and without an explicit checkpoint, it may start redo recovery from a much # older point, which includes even create table and initial page additions. $PSQL -p $PORT_STANDBY -c 'checkpoint;' # Now just use a dummy table and run some operations to move minRecoveryPoint # beyond the previous vacuum. $PSQL -p $PORT_PRIMARY -c 'create table test2 (a int, b text);' $PSQL -p $PORT_PRIMARY -c 'insert into test2 select generate_series(1,10000), md5(random()::text)' $PSQL -p $PORT_PRIMARY -c 'truncate test2;' # Wait again for replay on standby side. sleep 5 # Promote standby and wait for it to finish. pg_ctl promote -w -D $DATA_STANDBY sleep 5 # Truncate the table on the promoted standby, vacuum and extend it # again to create new page references. The first post-recovery checkpoint # has not happened yet $PSQL -p $PORT_STANDBY -c 'truncate test1;' $PSQL -p $PORT_STANDBY -c 'vacuum verbose test1;' $PSQL -p $PORT_STANDBY -c 'insert into test1 select generate_series(1,1000);' pg_ctl -w --mode immediate -D $DATA_STANDBY stop # Crash should happen here pg_ctl -w -D $DATA_STANDBY start # Wait for recovery to finish sleep 5 $PSQL -p $PORT_STANDBY -c 'SELECT count(*) FROM test1'
promote-panic-test-93.tar.gz
Description: application/gzip
promote-panic-test-94.tar.gz
Description: application/gzip
promote-panic-test-95.tar.gz
Description: application/gzip
promote-panic-test-96.tar.gz
Description: application/gzip
signature.asc
Description: PGP signature