Coincidentally, one of my buildfarm animals hanged several weeks in a different test, 035_standby_logical_decoding.pl. A LOG_SNAPSHOT_INTERVAL_MS reduction was part of making it reproducible:
On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote: > On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote: > > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote: > > > If you look at the buildfarm's failures page and filter down to > > > just subscriptionCheck failures, what you find is that all of the > > > last 6 such failures are in 031_column_list.pl: > > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com > > reported a replacement BgWriterDelay value reproducing it. > > Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to > BgWriterDelay. I'm reusing this thread just in case there's overlap with the 031_column_list.pl cause and fix. The 035_standby_logical_decoding.pl hang is a race condition arising from an event sequence like this: - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU. - Test script calls pg_log_standby_snapshot() on primary. Emits XLOG_RUNNING_XACTS. - checkpoint_timeout makes a primary checkpoint finish. Emits XLOG_RUNNING_XACTS. - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic. Emits XLOG_RUNNING_XACTS. - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby. Other test code already has a solution for this, so the attached patches add a timeout and copy the existing solution. I'm also attaching the hack that makes it 100% reproducible.
Author: Noah Misch <n...@leadboat.com> Commit: Noah Misch <n...@leadboat.com> [not for commit] Demonstrate 035_standby_logical_decoding.pl hang. There are three way XLOG_RUNNING_XACTS can show up: - SELECT pg_log_standby_snapshot() - CHECKPOINT - bgwriter LOG_SNAPSHOT_INTERVAL_MS An idle primary won't do the second two. If you're unlucky enough for all three to happen before CREATE_REPLICATION_SLOT's DecodingContextFindStartpoint() call starts watching, then CREATE_REPLICATION_SLOT will hang indefinitely. diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index f11ce27..07f987c 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -70,7 +70,7 @@ int BgWriterDelay = 200; * Interval in which standby snapshots are logged into the WAL stream, in * milliseconds. */ -#define LOG_SNAPSHOT_INTERVAL_MS 15000 +#define LOG_SNAPSHOT_INTERVAL_MS 1 /* * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index 9270d7b..0f170a4 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -1085,6 +1085,7 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname, appendStringInfoString(&cmd, " PHYSICAL RESERVE_WAL"); } + pg_usleep(1000 * 1000); res = libpqrcv_PQexec(conn->streamConn, cmd.data); pfree(cmd.data); diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index cebfa52..c371369 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -467,6 +467,7 @@ $psql_subscriber{run}->pump_nb(); # Speed up the subscription creation $node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()"); +$node_primary->safe_psql('postgres', "CHECKPOINT"); # Explicitly shut down psql instance gracefully - to avoid hangs # or worse on windows
Author: Noah Misch <n...@leadboat.com> Commit: Noah Misch <n...@leadboat.com> Bound waits in 035_standby_logical_decoding.pl. One IPC::Run::start() used an IPC::Run::timer() without checking for expiration. The other used no timeout or timer. Back-patch to v16, which introduced the test. Reviewed by FIXME. Discussion: https://postgr.es/m/FIXME diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index c371369..61da915 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -21,7 +21,6 @@ my $node_cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby'); my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); my $default_timeout = $PostgreSQL::Test::Utils::timeout_default; -my $psql_timeout = IPC::Run::timer($default_timeout); my $res; # Name for the physical slot on primary @@ -90,7 +89,8 @@ sub make_slot_active '>', $to_stdout, '2>', - $to_stderr); + $to_stderr, + IPC::Run::timeout($default_timeout)); if ($wait) { @@ -341,7 +341,7 @@ $psql_subscriber{run} = IPC::Run::start( \$psql_subscriber{subscriber_stdout}, '2>', \$psql_subscriber{subscriber_stderr}, - $psql_timeout); + IPC::Run::timeout($default_timeout)); ################################################## # Test that logical decoding on the standby
Author: Noah Misch <n...@leadboat.com> Commit: Noah Misch <n...@leadboat.com> Fix test race between primary XLOG_RUNNING_XACTS and standby logical slot. Before the previous commit, the test could hang until LOG_SNAPSHOT_INTERVAL_MS (15s), until checkpoint_timeout (300s), or indefinitely. An indefinite hang was awfully improbable. It entailed the test reaching checkpoint_timeout after a CREATE SUBSCRIPTION and before its DecodingContextFindStartpoint(). Back-patch to v16, which introduced the test. Reviewed by FIXME. Discussion: https://postgr.es/m/FIXME diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index 61da915..f7b7fc7 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -466,4 +466,11 @@ $psql_subscriber{subscriber_stdin} .= "\n"; $psql_subscriber{run}->pump_nb(); +# See corresponding create_logical_slot_on_standby() code. +$node_standby->poll_query_until( + 'postgres', qq[ + SELECT restart_lsn IS NOT NULL + FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' + ]) or die "timed out waiting for logical slot to calculate its restart_lsn"; + # Speed up the subscription creation $node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");