Coincidentally, one of my buildfarm animals hanged several weeks in a
different test, 035_standby_logical_decoding.pl.  A LOG_SNAPSHOT_INTERVAL_MS
reduction was part of making it reproducible:

On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote:
> On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > > If you look at the buildfarm's failures page and filter down to
> > > just subscriptionCheck failures, what you find is that all of the
> > > last 6 such failures are in 031_column_list.pl:

> > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> > reported a replacement BgWriterDelay value reproducing it.
> 
> Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
> BgWriterDelay.

I'm reusing this thread just in case there's overlap with the
031_column_list.pl cause and fix.  The 035_standby_logical_decoding.pl hang is
a race condition arising from an event sequence like this:

- Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
- Test script calls pg_log_standby_snapshot() on primary.  Emits 
XLOG_RUNNING_XACTS.
- checkpoint_timeout makes a primary checkpoint finish.  Emits 
XLOG_RUNNING_XACTS.
- bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits XLOG_RUNNING_XACTS.
- CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.

Other test code already has a solution for this, so the attached patches add a
timeout and copy the existing solution.  I'm also attaching the hack that
makes it 100% reproducible.
Author:     Noah Misch <n...@leadboat.com>
Commit:     Noah Misch <n...@leadboat.com>

    [not for commit] Demonstrate 035_standby_logical_decoding.pl hang.
    
    There are three way XLOG_RUNNING_XACTS can show up:
    - SELECT pg_log_standby_snapshot()
    - CHECKPOINT
    - bgwriter LOG_SNAPSHOT_INTERVAL_MS
    
    An idle primary won't do the second two.  If you're unlucky enough for
    all three to happen before CREATE_REPLICATION_SLOT's
    DecodingContextFindStartpoint() call starts watching, then
    CREATE_REPLICATION_SLOT will hang indefinitely.

diff --git a/src/backend/postmaster/bgwriter.c 
b/src/backend/postmaster/bgwriter.c
index f11ce27..07f987c 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -70,7 +70,7 @@ int                   BgWriterDelay = 200;
  * Interval in which standby snapshots are logged into the WAL stream, in
  * milliseconds.
  */
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 1
 
 /*
  * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c 
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9270d7b..0f170a4 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -1085,6 +1085,7 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char 
*slotname,
                        appendStringInfoString(&cmd, " PHYSICAL RESERVE_WAL");
        }
 
+       pg_usleep(1000 * 1000);
        res = libpqrcv_PQexec(conn->streamConn, cmd.data);
        pfree(cmd.data);
 
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl 
b/src/test/recovery/t/035_standby_logical_decoding.pl
index cebfa52..c371369 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -467,6 +467,7 @@ $psql_subscriber{run}->pump_nb();
 
 # Speed up the subscription creation
 $node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+$node_primary->safe_psql('postgres', "CHECKPOINT");
 
 # Explicitly shut down psql instance gracefully - to avoid hangs
 # or worse on windows
Author:     Noah Misch <n...@leadboat.com>
Commit:     Noah Misch <n...@leadboat.com>

    Bound waits in 035_standby_logical_decoding.pl.
    
    One IPC::Run::start() used an IPC::Run::timer() without checking for
    expiration.  The other used no timeout or timer.  Back-patch to v16,
    which introduced the test.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/FIXME

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl 
b/src/test/recovery/t/035_standby_logical_decoding.pl
index c371369..61da915 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -21,7 +21,6 @@ my $node_cascading_standby =
   PostgreSQL::Test::Cluster->new('cascading_standby');
 my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
-my $psql_timeout = IPC::Run::timer($default_timeout);
 my $res;
 
 # Name for the physical slot on primary
@@ -90,7 +89,8 @@ sub make_slot_active
                '>',
                $to_stdout,
                '2>',
-               $to_stderr);
+               $to_stderr,
+               IPC::Run::timeout($default_timeout));
 
        if ($wait)
        {
@@ -341,7 +341,7 @@ $psql_subscriber{run} = IPC::Run::start(
        \$psql_subscriber{subscriber_stdout},
        '2>',
        \$psql_subscriber{subscriber_stderr},
-       $psql_timeout);
+       IPC::Run::timeout($default_timeout));
 
 ##################################################
 # Test that logical decoding on the standby
Author:     Noah Misch <n...@leadboat.com>
Commit:     Noah Misch <n...@leadboat.com>

    Fix test race between primary XLOG_RUNNING_XACTS and standby logical slot.
    
    Before the previous commit, the test could hang until
    LOG_SNAPSHOT_INTERVAL_MS (15s), until checkpoint_timeout (300s), or
    indefinitely.  An indefinite hang was awfully improbable.  It entailed
    the test reaching checkpoint_timeout after a CREATE SUBSCRIPTION and
    before its DecodingContextFindStartpoint().  Back-patch to v16, which
    introduced the test.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/FIXME

diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl 
b/src/test/recovery/t/035_standby_logical_decoding.pl
index 61da915..f7b7fc7 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -466,4 +466,11 @@ $psql_subscriber{subscriber_stdin} .= "\n";
 $psql_subscriber{run}->pump_nb();
 
+# See corresponding create_logical_slot_on_standby() code.
+$node_standby->poll_query_until(
+       'postgres', qq[
+               SELECT restart_lsn IS NOT NULL
+               FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub'
+       ]) or die "timed out waiting for logical slot to calculate its 
restart_lsn";
+
 # Speed up the subscription creation
 $node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");

Reply via email to