Hi, I spent a bit more time fixing the TAP test. The attached patch makes it "work" for me (or I think it should, in principle). I'm not saying it's the best way to do stuff.
With the patch applied, I tried running it, and I got a failure when running pg_checksums. There's a log snippet describing the issue, but AFAICS it's happening like this: 1) checksums are disabled 2) flip_data_checksums gets called 3) both clusters go through 'inprogress-on' and 'on' states 4) primary gets shutdown in 'immediate' mode 5) standby gets shutdown in 'fast' mode 6) we try to validate checksums on the standby, but control file still says checksums=inprogress-on This seems like a bug to me - AFAICS the expectation is that after fast shutdown, we don't forget the checksum state. Or is that expected? In that case the TAP test probably needs to check the control file, instead of relying on the perl variable $data_checksum_state. Or maybe it should check that the control file has the correct / expected state? FWIW I don't think the primary shutdown matters. I've seen multiple of these failures, and it happens even without primary shutdown. But the standby "fast" shutdown is always there. But this also shows a limitation of the TAP test - it never triggers the shutdowns while flipping the checksums (in flip_data_checksums). I think that's something worth testing. regards -- Tomas Vondra
diff --git a/src/test/modules/test_checksums/t/006_concurrent_pgbench.pl b/src/test/modules/test_checksums/t/006_concurrent_pgbench.pl index b33ca6e0c26..5cee6d4a6b5 100644 --- a/src/test/modules/test_checksums/t/006_concurrent_pgbench.pl +++ b/src/test/modules/test_checksums/t/006_concurrent_pgbench.pl @@ -55,7 +55,7 @@ if ($ENV{enable_injection_points} ne 'yes') # whether to turn things off during testing. sub cointoss { - return int(rand(2) == 1); + return int(rand() < 0.5); } # Helper for injecting random sleeps here and there in the testrun. The sleep @@ -74,7 +74,7 @@ sub background_ro_pgbench my ($port, $stdin, $stdout, $stderr) = @_; my $pgbench_primary = IPC::Run::start( - [ 'pgbench', '-p', $port, '-S', '-T', '600', '-c', '10', 'postgres' ], + [ 'pgbench', '-n', '-p', $port, '-S', '-T', '600', '-c', '10', 'postgres' ], '<' => \$stdin, '>' => \$stdout, '2>' => \$stderr, @@ -224,6 +224,9 @@ background_rw_pgbench( $node_primary->port, $pgb_primary_stdin, $pgb_primary_stdout, $pgb_primary_stderr); +my $primary_shutdown_clean = 0; +my $standby_shutdown_clean = 0; + # Main test suite. This loop will start a pgbench run on the cluster and while # that's running flip the state of data checksums concurrently. It will then # randomly restart thec cluster (in fast or immediate) mode and then check for @@ -246,9 +249,11 @@ for (my $i = 0; $i < $TEST_ITERATIONS; $i++) $node_primary_loglocation = -s $node_primary->logfile; # If data checksums are enabled, take the opportunity to verify them - # while the cluster is offline + # while the cluster is offline (but only if stopped in a clean way, + # not after immediate shutdown) $node_primary->checksum_verify_offline() - unless $data_checksum_state eq 'off'; + unless $data_checksum_state eq 'off' or !$primary_shutdown_clean; + random_sleep(); $node_primary->start; # Start a pgbench in the background against the primary @@ -270,9 +275,11 @@ for (my $i = 0; $i < $TEST_ITERATIONS; $i++) $node_standby_1_loglocation = -s $node_standby_1->logfile; # If data checksums are enabled, take the opportunity to verify them - # while the cluster is offline + # while the cluster is offline (but only if stopped in a clean way, + # not after immediate shutdown) $node_standby_1->checksum_verify_offline() - unless $data_checksum_state eq 'off'; + unless $data_checksum_state eq 'off' or !$standby_shutdown_clean; + random_sleep(); $node_standby_1->start; # Start a select-only pgbench in the background on the standby @@ -287,13 +294,41 @@ for (my $i = 0; $i < $TEST_ITERATIONS; $i++) my $result = $node_primary->safe_psql('postgres', "SELECT count(*) FROM t WHERE a > 1"); is($result, '100000', 'ensure data pages can be read back on primary'); + random_sleep(); + $node_primary->wait_for_catchup($node_standby_1, 'write'); - # Potentially powercycle the cluster - $node_primary->stop($stop_modes[ int(rand(100)) ]) if cointoss(); random_sleep(); - $node_standby_1->stop($stop_modes[ int(rand(100)) ]) if cointoss(); + + # Potentially powercycle the cluster (the nodes independently) + # XXX should maybe try stopping nodes in the opposite order too? + if (cointoss()) + { + my $mode = $stop_modes[ int(rand(100)) ]; + $node_primary->stop($mode); + $primary_shutdown_clean = ($mode eq 'fast'); + } + + random_sleep(); + + if (cointoss()) + { + my $mode = $stop_modes[ int(rand(100)) ]; + $node_standby_1->stop($mode); + $standby_shutdown_clean = ($mode eq 'fast'); + } +} + +# make sure the nodes are running +if (!$node_primary->is_alive) +{ + $node_primary->start; +} + +if (!$node_standby_1->is_alive) +{ + $node_standby_1->start; } # Testrun is over, ensure that data reads back as expected and perform a final
# Postmaster PID for node "standby_1" is 27122 [17:38:45.503](0.673s) ok 104 - ensure checksums are set to off [17:38:45.513](0.011s) ok 105 - ensure checksums are set to off [17:38:45.537](0.024s) ok 106 - ensure data checksums are transitioned to inprogress-on Waiting for replication conn standby_1's replay_lsn to pass 6/8FCAB730 on main done [17:38:46.574](1.037s) ok 107 - ensure standby has absorbed the inprogress-on barrier [17:38:47.585](1.011s) ok 108 - ensure checksums are on, or in progress, on standby_1 [17:39:03.705](16.119s) ok 109 - ensure data checksums are transitioned to on [17:39:03.716](0.011s) ok 110 - ensure data checksums are transitioned to on [17:39:03.784](0.068s) ok 111 - ensure data pages can be read back on primary Waiting for replication conn standby_1's write_lsn to pass 7/2B4BD0F0 on main done ### Stopping node "main" using mode immediate # Running: pg_ctl --pgdata /home/tomas/postgres/src/test/modules/test_checksums/tmp_check/t_006_concurrent_pgbench_main_data/pgdata --mode immediate stop waiting for server to shut down.... done server stopped # No postmaster PID for node "main" ### Stopping node "standby_1" using mode fast # Running: pg_ctl --pgdata /home/tomas/postgres/src/test/modules/test_checksums/tmp_check/t_006_concurrent_pgbench_standby_1_data/pgdata --mode fast stop waiting for server to shut down.... done server stopped # No postmaster PID for node "standby_1" # Running: pg_isready --timeout 180 --host /tmp/800zPudzD2 --port 30082 /tmp/800zPudzD2:30082 - no response [17:39:06.021](2.237s) ok 112 - no checksum validation errors in primary log ### Starting node "main" # Running: pg_ctl --wait --pgdata /home/tomas/postgres/src/test/modules/test_checksums/tmp_check/t_006_concurrent_pgbench_main_data/pgdata --log /home/tomas/postgres/src/test/modules/test_checksums/tmp_check/log/006_concurrent_pgbench_main.log --options --cluster-name=main start waiting for server to start.... done server started # Postmaster PID for node "main" is 27488 # Running: pg_isready --timeout 180 --host /tmp/800zPudzD2 --port 30083 /tmp/800zPudzD2:30083 - no response [17:39:06.132](0.111s) ok 113 - no checksum validation errors in standby_1 log # Running: pg_checksums -D /home/tomas/postgres/src/test/modules/test_checksums/tmp_check/t_006_concurrent_pgbench_standby_1_data/pgdata -c pg_checksums: error: data checksums are not enabled in cluster [17:39:06.134](0.002s) Bail out! command "pg_checksums -D /home/tomas/postgres/src/test/modules/test_checksums/tmp_check/t_006_concurrent_pgbench_standby_1_data/pgdata -c" exited with value 1 # Postmaster PID for node "main" is 27488 ### Stopping node "main" using mode immediate # Running: pg_ctl --pgdata /home/tomas/postgres/src/test/modules/test_checksums/tmp_check/t_006_concurrent_pgbench_main_data/pgdata --mode immediate stop waiting for server to shut down.... done server stopped # No postmaster PID for node "main" # No postmaster PID for node "standby_1"