Hello Michael and Bertrand,

15.01.2024 06:59, Michael Paquier wrote:
The WAL records related to standby snapshots are playing a lot with
the randomness of the failures we are seeing.  Alexander has mentioned
offlist something else: using SIGSTOP on the bgwriter to avoid these
records and make the test more stable.  That would not be workable for
Windows, but I could live with that knowing that logical decoding for
standbys has no platform-speficic tweak for the code paths we're
testing here, and that would put as limitation to skip the test for
$windows_os.

I've found a way to implement pause/resume for Windows processed and it
looks acceptable to me if we can afford "use Win32::API;" on Windows
(maybe the test could be skipped only if this perl module is absent).
Please look at the PoC patch for the test 035_standby_logical_decoding.
(The patched test passes for me.)

If this approach looks promising to you, maybe we could add a submodule to
perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
019_replslot_limit) as well.

Personally I think that having such a functionality for using in tests
might be useful not only to avoid some "problematic" behaviour but also to
test the opposite cases.

While thinking about that, a second idea came into my mind: a
superuser-settable developer GUC to disable such WAL records to be
generated within certain areas of the test.  This requires a small
implementation, but nothing really huge, while being portable
everywhere.  And it is not the first time I've been annoyed with these
records when wanting a predictible set of WAL records for some test
case.

I see that the test in question exists in REL_16_STABLE, it means that a
new GUC would not help there?

Best regards,
Alexander
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8bc39a5f03..8b08c7b5c7 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,6 +10,8 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+use Win32::API;
+
 my ($stdin, $stdout, $stderr,
 	$cascading_stdout, $cascading_stderr, $subscriber_stdin,
 	$subscriber_stdout, $subscriber_stderr, $ret,
@@ -28,6 +30,28 @@ my $res;
 my $primary_slotname = 'primary_physical';
 my $standby_physical_slotname = 'standby_physical';
 
+my $OpenProcess = new Win32::API("kernel32", "HANDLE OpenProcess(DWORD dwDesiredAccess, BOOL bInheritHandle, DWORD dwProcessId )");
+my $NtSuspendProcess = new Win32::API("ntdll", 'LONG NtSuspendProcess(HANDLE hProcess)');
+my $NtResumeProcess = new Win32::API("ntdll", 'LONG NtResumeProcess(HANDLE hProcess)');
+my $CloseHandle = new Win32::API("kernel32", "BOOL CloseHandle(HANDLE hObject)");
+my $PROCESS_ALL_ACCESS = 0x001F0FFF;
+
+sub suspend_process
+{
+	my $pid = shift;
+	my $hProcess = $OpenProcess->Call($PROCESS_ALL_ACCESS, 0, $pid);
+	$NtSuspendProcess->Call($hProcess);
+	$CloseHandle->Call($hProcess);
+}
+
+sub resume_process
+{
+	my $pid = shift;
+	my $hProcess = $OpenProcess->Call($PROCESS_ALL_ACCESS, 0, $pid);
+	$NtResumeProcess->Call($hProcess);
+	$CloseHandle->Call($hProcess);
+}
+
 # Fetch xmin columns from slot's pg_replication_slots row, after waiting for
 # given boolean condition to be true to ensure we've reached a quiescent state.
 sub wait_for_xmins
@@ -456,6 +480,9 @@ is($result, qq(10), 'check replicated inserts after subscription on standby');
 $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
 $node_subscriber->stop;
 
+my $bgwriterpid = $node_primary->safe_psql('postgres', "SELECT pid FROM pg_stat_activity WHERE backend_type = 'background writer'");
+suspend_process($bgwriterpid);
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -690,6 +717,7 @@ $node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(25);]);
 $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
 
 $node_primary->wait_for_replay_catchup($node_standby);
@@ -709,6 +737,8 @@ check_pg_recvlogical_stderr($handle,
 # Turn hot_standby_feedback back on
 change_hot_standby_feedback_and_wait_for_xmins(1, 1);
 
+resume_process($bgwriterpid);
+
 ##################################################
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 5: incorrect wal_level on primary.

Reply via email to