Re: BUG: Former primary node might stuck when started as a standby

Alexander Lakhin Fri, 13 Feb 2026 12:00:25 -0800

Dear Kuroda-san,

13.02.2026 04:03, Hayato Kuroda (Fujitsu) wrote:

Dear Alexander,


I checked your test and reproduced the issue with it.
Was it possible that INSERT happened in-between wait_for_replay_catchup and
teardown_node? In this case we may not ensure WAL records generated in the time
window were reached, right?
Similar stuff won7t happen in 009_twophase.pl because it does not have the bg 
activities.


From my old records, 009_twophase.pl failed exactly due to background (
namely, bgwriter's) activity.

I modified bgwriter.c to reproduce the failure easier:
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -67,7 +67,7 @@ int            BgWriterDelay = 200;
  * Interval in which standby snapshots are logged into the WAL stream, in
  * milliseconds.
  */
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 1

 /*
  * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
@@ -306,7 +306,7 @@ BackgroundWriterMain(const void *startup_data, size_t 
startup_data_len)
          */
         rc = WaitLatch(MyLatch,
                        WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-                       BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
+                       1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);

         /*
          * If no latch event and BgBufferSync says nothing's happening, extend
@@ -339,6 +339,5 @@ BackgroundWriterMain(const void *startup_data, size_t 
startup_data_len)
             StrategyNotifyBgWriter(-1);
         }

-        prev_hibernate = can_hibernate;
     }
 }

multiplied the test to increase probability of the failure:

for i in {1..20}; do cp -r src/test/recovery/ src/test/recovery_$i/; sed "s|src/test/recovery|src/test/recovery_$i|" -isrc/test/recovery_$i/Makefile; done


and executed it in a loop:

for i in {1..100}; do echo "ITERATION $i"; parallel --halt now,fail=1 -j20 --linebuffer --tag PROVE_TESTS="t/009*"NO_TEMP_INSTALL=1 timeout 60 make check -s -C src/test/recovery_{} ::: `seq 20` || break; done


It failed for me on iterations 27, 4, 22 as below:
ITERATION 22
...
18      t/009_twophase.pl .. ok
18      All tests successful.
18      Files=1, Tests=30, 12 wallclock secs ( 0.01 usr  0.01 sys + 0.26 cusr  
0.58 csys =  0.86 CPU)
18      Result: PASS
5       make: *** wait: No child processes.  Stop.
5       make: *** Waiting for unfinished jobs....
5       make: *** wait: No child processes.  Stop.
parallel: This job failed:
PROVE_TESTS=t/009* NO_TEMP_INSTALL=1 timeout 60 make check -s -C 
src/test/recovery_5

src/test/recovery_5/tmp_check/log/009_twophase_london.log contains:

2026-02-13 21:03:28.248 EET [3987222] LOG: new timeline 2 forked off current database system timeline 1 before currentrecovery point 0/3029190

...

(Without "timeout 60", the test just hangs — we can see the same in [1],
the test was killed with SIGTERM after 15000 seconds...)

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2026-02-04%2013%3A36%3A40&stg=recovery-check

Best regards,
Alexander

Re: BUG: Former primary node might stuck when started as a standby

Reply via email to