Hello Michael and Bertrand,
15.01.2024 06:59, Michael Paquier wrote:
The WAL records related to standby snapshots are playing a lot with
the randomness of the failures we are seeing. Alexander has mentioned
offlist something else: using SIGSTOP on the bgwriter to avoid these
records and make the test more stable. That would not be workable for
Windows, but I could live with that knowing that logical decoding for
standbys has no platform-speficic tweak for the code paths we're
testing here, and that would put as limitation to skip the test for
$windows_os.
I've found a way to implement pause/resume for Windows processed and it
looks acceptable to me if we can afford "use Win32::API;" on Windows
(maybe the test could be skipped only if this perl module is absent).
Please look at the PoC patch for the test 035_standby_logical_decoding.
(The patched test passes for me.)
If this approach looks promising to you, maybe we could add a submodule to
perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
019_replslot_limit) as well.
Personally I think that having such a functionality for using in tests
might be useful not only to avoid some "problematic" behaviour but also to
test the opposite cases.
While thinking about that, a second idea came into my mind: a
superuser-settable developer GUC to disable such WAL records to be
generated within certain areas of the test. This requires a small
implementation, but nothing really huge, while being portable
everywhere. And it is not the first time I've been annoyed with these
records when wanting a predictible set of WAL records for some test
case.
I see that the test in question exists in REL_16_STABLE, it means that a
new GUC would not help there?
Best regards,
Alexander
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 8bc39a5f03..8b08c7b5c7 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -10,6 +10,8 @@ use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;
+use Win32::API;
+
my ($stdin, $stdout, $stderr,
$cascading_stdout, $cascading_stderr, $subscriber_stdin,
$subscriber_stdout, $subscriber_stderr, $ret,
@@ -28,6 +30,28 @@ my $res;
my $primary_slotname = 'primary_physical';
my $standby_physical_slotname = 'standby_physical';
+my $OpenProcess = new Win32::API("kernel32", "HANDLE OpenProcess(DWORD dwDesiredAccess, BOOL bInheritHandle, DWORD dwProcessId )");
+my $NtSuspendProcess = new Win32::API("ntdll", 'LONG NtSuspendProcess(HANDLE hProcess)');
+my $NtResumeProcess = new Win32::API("ntdll", 'LONG NtResumeProcess(HANDLE hProcess)');
+my $CloseHandle = new Win32::API("kernel32", "BOOL CloseHandle(HANDLE hObject)");
+my $PROCESS_ALL_ACCESS = 0x001F0FFF;
+
+sub suspend_process
+{
+ my $pid = shift;
+ my $hProcess = $OpenProcess->Call($PROCESS_ALL_ACCESS, 0, $pid);
+ $NtSuspendProcess->Call($hProcess);
+ $CloseHandle->Call($hProcess);
+}
+
+sub resume_process
+{
+ my $pid = shift;
+ my $hProcess = $OpenProcess->Call($PROCESS_ALL_ACCESS, 0, $pid);
+ $NtResumeProcess->Call($hProcess);
+ $CloseHandle->Call($hProcess);
+}
+
# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
# given boolean condition to be true to ensure we've reached a quiescent state.
sub wait_for_xmins
@@ -456,6 +480,9 @@ is($result, qq(10), 'check replicated inserts after subscription on standby');
$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
$node_subscriber->stop;
+my $bgwriterpid = $node_primary->safe_psql('postgres', "SELECT pid FROM pg_stat_activity WHERE backend_type = 'background writer'");
+suspend_process($bgwriterpid);
+
##################################################
# Recovery conflict: Invalidate conflicting slots, including in-use slots
# Scenario 1: hot_standby_feedback off and vacuum FULL
@@ -690,6 +717,7 @@ $node_primary->safe_psql('testdb', qq[INSERT INTO prun VALUES (1, 'A');]);
$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'B';]);
$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'C';]);
$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(25);]);
$node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
$node_primary->wait_for_replay_catchup($node_standby);
@@ -709,6 +737,8 @@ check_pg_recvlogical_stderr($handle,
# Turn hot_standby_feedback back on
change_hot_standby_feedback_and_wait_for_xmins(1, 1);
+resume_process($bgwriterpid);
+
##################################################
# Recovery conflict: Invalidate conflicting slots, including in-use slots
# Scenario 5: incorrect wal_level on primary.