On 2020-Apr-28, Kyotaro Horiguchi wrote:

> > Anyway I think this patch should fix it also -- instead of adding a new
> > flag, we just rely on the existing flags (since do_checkpoint must have
> > been set correctly from the flags earlier in that block.)
> 
> Since the added (!do_checkpoint) check is reached with
> do_checkpoint=false at server start and at archive_timeout intervals,
> the patch makes checkpointer run a busy-loop at that timings, and that
> loop lasts until a checkpoint is actually executed.
> 
> What we need to do here is not forgetting the fact that the latch has
> been set even if the latch itself gets reset before reaching to
> WaitLatch.

After a few more false starts :-) I think one easy thing we can do
without the additional boolean flag is to call SetLatch there in the
main loop if we see that ckpt_flags is nonzero.

(I had two issues with the boolean flag.  One is that the comment in
ReqCheckpointHandler needed an update to, essentially, say exactly the
opposite of what it was saying; such a change was making me very
uncomfortable.  The other is that the place where the flag was reset in
CheckpointerMain() was ... not really appropriate; or it could have been
appropriate if the flag was called, say, "CheckpointerMainNoSleepOnce".
Because "RequestPending" was the wrong name to use, because if the flag
was for really request pending, then we should reset it inside the "if
do_checkpoint" block .. but as I understand this would cause the
busy-loop behavior you described.)

> The attached patch on 019_replslot_limit.pl does the commands above
> automatically. It sometimes succeed but fails in most cases, at least
> for me.

With the additional SetLatch, the test passes reproducibly for me.
Before the patch, it failed ten out of ten times I ran it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 74751b6a3a049ff83c6bef99e2e39562278a7ba6 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Mon, 27 Apr 2020 19:35:15 -0400
Subject: [PATCH] Fix checkpoint signalling

Because the checkpointer process now uses its MyLatch to wait for
walsenders to go away (per commit c6550776394e), we must ensure to nudge
it when going to sleep.
---
 src/backend/postmaster/checkpointer.c     |  8 +++++
 src/test/recovery/t/019_replslot_limit.pl | 44 +++++++++++++++++++++--
 2 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e354a78725..94e0161162 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -494,6 +494,14 @@ CheckpointerMain(void)
 		 */
 		pgstat_send_bgwriter();
 
+		/*
+		 * If any checkpoint flags have been set, nudge our latch so that the
+		 * wait below will return immediately, even if a latch signal was
+		 * consumed elsewhere.
+		 */
+		if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
+			SetLatch(MyLatch);
+
 		/*
 		 * Sleep until we are signaled or it's time for another checkpoint or
 		 * xlog file switch.
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 32dce54522..634f2bec8b 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -8,7 +8,7 @@ use TestLib;
 use PostgresNode;
 
 use File::Path qw(rmtree);
-use Test::More tests => 13;
+use Test::More tests => 14;
 use Time::HiRes qw(usleep);
 
 $ENV{PGDATABASE} = 'postgres';
@@ -179,7 +179,47 @@ for (my $i = 0; $i < 10000; $i++)
 }
 ok($failed, 'check that replication has been broken');
 
-$node_standby->stop;
+$node_master->stop('immediate');
+$node_standby->stop('immediate');
+
+my $node_master2 = get_new_node('master2');
+$node_master2->init(allows_streaming => 1);
+$node_master2->append_conf(
+	'postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 32MB
+log_checkpoints = yes
+));
+$node_master2->start;
+$node_master2->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+$backup_name = 'my_backup2';
+$node_master2->backup($backup_name);
+
+$node_master2->stop;
+$node_master2->append_conf(
+	'postgresql.conf', qq(
+max_slot_wal_keep_size = 0
+));
+$node_master2->start;
+
+$node_standby = get_new_node('standby_2');
+$node_standby->init_from_backup($node_master2, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', "primary_slot_name = 'rep1'");
+$node_standby->start;
+my @result =
+  split(
+	'\n',
+	$node_master2->safe_psql(
+		'postgres',
+		"CREATE TABLE tt();
+		 DROP TABLE tt;
+		 SELECT pg_switch_wal();
+		 CHECKPOINT;
+		 SELECT 'finished';",
+		timeout => '60'));
+is($result[1], 'finished', 'check if checkpoint command is not blocked');
 
 #####################################
 # Advance WAL of $node by $n segments
-- 
2.20.1

Reply via email to