On 2022-Oct-01, Andres Freund wrote:

> Perhaps the END{} routine should call $node->_update_pid(-1); if $exit_code !=
> 0 and _pid is undefined?

Yeah, that sounds reasonable.

> That does seem to reduce the incidence of "leftover" postgres
> instances. 001_start_stop.pl leaves some behind, but that makes sense, because
> it's bypassing the whole node management. But I still occasionally see some
> remaining processes if I crank up test concurrency.
> 
> Ah! At least part of the problem is that sub stop() does BAIL_OUT, and of
> course it can fail as part of the shutdown.

I made teardown_node pass down fail_ok=>1 to avoid this problem, so we
no longer BAIL_OUT in that case.


> But there's still some that survive, where your perl.trace doesn't contain the
> node getting shut down...

Yeah, something's still unexplained.  I'll get this pushed soon, which
already reduces the number of leftover instances a good amount.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From b43b3071abfd7bb83f7ec5942095aa031c3ed038 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Tue, 18 Oct 2022 12:11:46 +0200
Subject: [PATCH v2] Better handle interrupting TAP tests

Set up a signal handler for INT/TERM so that we run our END block if we
get them.  In END, if the exit status indicates a problem, call
_update_pid(-1) to improve chances of the stop working in case start()
hasn't returned yet.

Also, change END's teardown_node() so that it passes fail_ok=>1, so that
if a node fails to stop, we still stop the other nodes in the same test.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index d7c13318b0..17856d69c8 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1545,7 +1545,13 @@ END
 
 	foreach my $node (@all_nodes)
 	{
-		$node->teardown_node;
+		# During unclean termination (which could be a signal or some
+		# other failure), we're not sure that the status of our nodes
+		# has been correctly set up already, so try and detect their
+		# status so that we can shut them down properly.
+		$node->_update_pid(-1) if $exit_code != 0;
+
+		$node->teardown_node(fail_ok => 1);
 
 		# skip clean if we are requested to retain the basedir
 		next if defined $ENV{'PG_TEST_NOCLEAN'};
@@ -1564,13 +1570,15 @@ END
 
 Do an immediate stop of the node
 
+Any optional extra parameter is passed to ->stop.
+
 =cut
 
 sub teardown_node
 {
-	my $self = shift;
+	my ($self, %params) = @_;
 
-	$self->stop('immediate');
+	$self->stop('immediate', %params);
 	return;
 }
 
@@ -2922,6 +2930,13 @@ sub corrupt_page_checksum
 	return;
 }
 
+#
+# Signal handlers
+#
+$SIG{TERM} = $SIG{INT} = sub {
+	die "death by signal";
+};
+
 =pod
 
 =back
-- 
2.30.2

Reply via email to