Re: Refactoring postmaster's code to cleanup after child exit

Andres Freund Thu, 06 Mar 2025 12:16:43 -0800

Hi,

On 2025-03-05 20:49:33 -0800, Noah Misch wrote:
> > This behaviour makes it really hard to debug problems. It'd have been a lot
> > easier to understand the problem if we'd seen psql's stderr before the test
> > died.
> > 
> > I guess that mean at the very least we'd need to put an eval {} around the
> > ->pump() call., print $self->{stdout}, ->{stderr} and reraise an error?
> 
> That sounds right.

In the attached patch I did that for wait_connect().  I did verify that it
works by implementing the wait_connect() fix before fixing
002_connection_limits.pl, which fails if a sleep(1) is added just before the
proc_exit(1) for FATAL.

I didn't yet tackle pump_until() yet as it

a) uses pumpable() to check if it's safe to pump() and should kinda sometimes
   maybe report an error, even though the fact that it doesn't display stderr
   (if stout is waited on) makes it harder to debug.

b) Fixing the error report seems like it'd require an interface change to
   pump_until().

> Officially, you could call ->pumpable() before ->pump().  It's defined as
> 'Returns TRUE if calling pump() won't throw an immediate "process ended
> prematurely" exception.'

It's also documented to be internal only...

I do share your doubts re pumpable():

> I lack high confidence that it avoids the exception,
> because the pump() still calls pumpable()->reap_nb()->waitpid(WNOHANG) and may
> decide "process ended prematurely" based on the new finding.  In other words,
> I bet there would be a TOCTOU defect in "$h->pump if $h->pumpable".

On 2025-03-05 08:23:32 +0900, Michael Paquier wrote:
> > For this test, could we perhaps rely on the log messages postmaster logs 
> > when
> > child processes exit?
> > 
> > 2025-03-04 17:56:12.528 EST [3509838][not initialized][:0][[unknown]] LOG:  
> > connection received: host=[local]
> > 2025-03-04 17:56:12.528 EST [3509838][client backend][:0][[unknown]] FATAL: 
> >  sorry, too many clients already
> > 2025-03-04 17:56:12.529 EST [3509817][postmaster][:0][] DEBUG:  releasing 
> > pm child slot 2
> > 2025-03-04 17:56:12.529 EST [3509817][postmaster][:0][] DEBUG:  client 
> > backend (PID 3509838) exited with exit code 1
> > 
> > I.e. the test could wait for the 'client backend exited' message using
> > ->wait_for_log()?
> 
> Matching expected contents in the server logs is a practice I've found
> to be rather reliable, with wait_for_log().

The attached patch implements that approach. It does fix the problem from what
I can tell. It's not great that it requires log_min_messages = DEBUG2, but
that seems ok for this test.

> Why not adding an injection point with a WARNING or a LOG generated, then
> check the server logs for the code path taken based on the elog() generated
> with the point name?

I think the log_min_messages approach is a lot simpler. If we need something
like this more widely we can reconsider injection points...

I also attached a patch to improve connect_fails()/connect_ok() test names a
bit. They weren't symmetric and I felt they were lacking in detail for the
psql return code check.

Another annoying and also funny problem I saw is this failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-06%2009%3A18%3A21
2025-03-06 10:42:02.552 UTC [372451][postmaster][:0] LOG:  1800 s is outside 
the valid range for parameter "authentication_timeout" (1 s .. 600 s)

I had to increase PG_TEST_TIMEOUT_DEFAULT due to some other test timing out
when run under valgrind (due to having to insert a lot of rows). But then this
test runs into the above issue.

The easiest way seems to be to just limit PG_TEST_TIMEOUT_DEFAULT in this
test.

Greetings,

Andres Freund

>From 9655d7a40e0d410b15457d69392de847ddf141ba Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Thu, 6 Mar 2025 10:36:24 -0500
Subject: [PATCH v1 1/4] tests: Improve test names in
 connect_fails()/connect_ok()

connect_fails() didn't mention that stderr matched, whereas connect_ok() did.

Neither connect_fails() nor connect_ok() mentioned what they were checking
when checking psql's return status.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index b105cba05a6..883532e1cd3 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2554,7 +2554,7 @@ sub connect_ok
 		connstr => "$connstr",
 		on_error_stop => 0);
 
-	is($ret, 0, $test_name);
+	is($ret, 0, "$test_name: connect succeeds, as expected");
 
 	if (defined($params{expected_stdout}))
 	{
@@ -2619,11 +2619,11 @@ sub connect_fails
 		extra_params => ['-w'],
 		connstr => "$connstr");
 
-	isnt($ret, 0, $test_name);
+	isnt($ret, 0, "$test_name: connect fails, as expected");
 
 	if (defined($params{expected_stderr}))
 	{
-		like($stderr, $params{expected_stderr}, "$test_name: matches");
+		like($stderr, $params{expected_stderr}, "$test_name: stderr matches");
 	}
 
 	$self->log_check($test_name, $log_location, %params);
-- 
2.48.1.76.g4e746b1a31.dirty

>From e080bcccd441c7e3a993ef72b7cd842ae939ecce Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Thu, 6 Mar 2025 10:38:44 -0500
Subject: [PATCH v1 2/4] tests: Add note if BackgroundPsql::wait_connect()
 fails

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 .../perl/PostgreSQL/Test/BackgroundPsql.pm    | 26 ++++++++++++++++---
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
index c611a61cf4e..1deb410c133 100644
--- a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
+++ b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
@@ -154,10 +154,28 @@ sub wait_connect
 	my $banner = "background_psql: ready";
 	my $banner_match = qr/(^|\n)$banner\r?\n/;
 	$self->{stdin} .= "\\echo $banner\n\\warn $banner\n";
-	$self->{run}->pump()
-	  until ($self->{stdout} =~ /$banner_match/
-		  && $self->{stderr} =~ /$banner\r?\n/)
-	  || $self->{timeout}->is_expired;
+
+	# IPC::Run throws in case psql exits while we're pumping. To make it
+	# easier to diagnose that, catch the error, report stdout/stderr at time
+	# of death and reraise.
+	eval {
+		$self->{run}->pump()
+		  until ($self->{stdout} =~ /$banner_match/
+			  && $self->{stderr} =~ /$banner\r?\n/)
+		  || $self->{timeout}->is_expired;
+	};
+	if ($@)
+	{
+		chomp(my $stdout = $self->{stdout});
+		chomp(my $stderr = $self->{stderr});
+		chomp(my $err = $@);
+		diag qq(psql died while connecting:
+  stdout: $stdout
+  stderr: $stderr
+  perl error: $err
+);
+		die "psql died while connecting";
+	}
 
 	note "connect output:\n",
 	  explain {
-- 
2.48.1.76.g4e746b1a31.dirty

>From 6ba326b1bb39a8d181137482df425a109539aa15 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Thu, 6 Mar 2025 10:30:21 -0500
Subject: [PATCH v1 3/4] tests: Try to fix race condition in
 postmaster/002_connection_limits

We need to wait for process exit.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 .../postmaster/t/002_connection_limits.pl     | 35 +++++++++++++++++--
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/src/test/postmaster/t/002_connection_limits.pl b/src/test/postmaster/t/002_connection_limits.pl
index 8cfa6e0ced5..2c185eef6eb 100644
--- a/src/test/postmaster/t/002_connection_limits.pl
+++ b/src/test/postmaster/t/002_connection_limits.pl
@@ -20,6 +20,7 @@ $node->append_conf('postgresql.conf', "max_connections = 6");
 $node->append_conf('postgresql.conf', "reserved_connections = 2");
 $node->append_conf('postgresql.conf', "superuser_reserved_connections = 1");
 $node->append_conf('postgresql.conf', "log_connections = on");
+$node->append_conf('postgresql.conf', "log_min_messages=debug2");
 $node->start;
 
 $node->safe_psql(
@@ -45,13 +46,39 @@ sub background_psql_as_user
 		extra_params => [ '-U', $user ]);
 }
 
+# Like connect_fails(), except that we also wait for the failed backend to
+# have exited.
+#
+# This tests needs to wait for client processes to exit because the error
+# message for a failed connection is reported before the backend has detached
+# from shared memory. If we didn't wait, subsequent tests might hit connection
+# limits spuriously.
+#
+# This can't easily be generalized, as detecting process exit requires
+# log_min_messages to be at least DEBUG2 and is not concurrency safe, as we
+# can't easily be sure the right process exited. In this test that's not a
+# problem though, we only have one new connection at a time.
+sub connect_fails_wait
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+	my ($node, $connstr, $test_name, %params) = @_;
+
+	my $log_location = -s $node->logfile;
+
+	$node->connect_fails($connstr, $test_name, %params);
+	$node->wait_for_log(qr/DEBUG:  client backend.*exited with exit code 1/,
+		$log_location);
+	ok(1, "$test_name: client backend process exited");
+}
+
 my @sessions = ();
 my @raw_connections = ();
 
 push(@sessions, background_psql_as_user('regress_regular'));
 push(@sessions, background_psql_as_user('regress_regular'));
 push(@sessions, background_psql_as_user('regress_regular'));
-$node->connect_fails(
+connect_fails_wait(
+	$node,
 	"dbname=postgres user=regress_regular",
 	"reserved_connections limit",
 	expected_stderr =>
@@ -60,7 +87,8 @@ $node->connect_fails(
 
 push(@sessions, background_psql_as_user('regress_reserved'));
 push(@sessions, background_psql_as_user('regress_reserved'));
-$node->connect_fails(
+connect_fails_wait(
+	$node,
 	"dbname=postgres user=regress_regular",
 	"reserved_connections limit",
 	expected_stderr =>
@@ -68,7 +96,8 @@ $node->connect_fails(
 );
 
 push(@sessions, background_psql_as_user('regress_superuser'));
-$node->connect_fails(
+connect_fails_wait(
+	$node,
 	"dbname=postgres user=regress_superuser",
 	"superuser_reserved_connections limit",
 	expected_stderr => qr/FATAL:  sorry, too many clients already/);
-- 
2.48.1.76.g4e746b1a31.dirty

>From b05b8d2031e430f425cb8c23b829002d7277c520 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Thu, 6 Mar 2025 15:13:10 -0500
Subject: [PATCH v1 4/4] tests: Don't fail due to high default timeout in
 postmaster/003_start_stop

Per buildfarm animal skink.

Discussion: https://postgr.es/m/20250306044933.7a.nmi...@google.com
---
 src/test/postmaster/t/003_start_stop.pl | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/test/postmaster/t/003_start_stop.pl b/src/test/postmaster/t/003_start_stop.pl
index 036b296f72b..4dc394139d9 100644
--- a/src/test/postmaster/t/003_start_stop.pl
+++ b/src/test/postmaster/t/003_start_stop.pl
@@ -20,6 +20,10 @@ use Test::More;
 # "pg_ctl stop" will error out before the authentication timeout kicks
 # in and cleans up the dead-end backends.
 my $authentication_timeout = $PostgreSQL::Test::Utils::timeout_default;
+
+# Don't fail due to hitting the max value allowed for authentication_timeout.
+$authentication_timeout = 600 unless $authentication_timeout < 600;
+
 my $stop_timeout = $authentication_timeout / 2;
 
 # Initialize the server with low connection limits, to test dead-end backends
-- 
2.48.1.76.g4e746b1a31.dirty

Re: Refactoring postmaster's code to cleanup after child exit

Reply via email to