On 2022-Oct-01, Andres Freund wrote: > Perhaps the END{} routine should call $node->_update_pid(-1); if $exit_code != > 0 and _pid is undefined?
Yeah, that sounds reasonable. > That does seem to reduce the incidence of "leftover" postgres > instances. 001_start_stop.pl leaves some behind, but that makes sense, because > it's bypassing the whole node management. But I still occasionally see some > remaining processes if I crank up test concurrency. > > Ah! At least part of the problem is that sub stop() does BAIL_OUT, and of > course it can fail as part of the shutdown. I made teardown_node pass down fail_ok=>1 to avoid this problem, so we no longer BAIL_OUT in that case. > But there's still some that survive, where your perl.trace doesn't contain the > node getting shut down... Yeah, something's still unexplained. I'll get this pushed soon, which already reduces the number of leftover instances a good amount. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
>From b43b3071abfd7bb83f7ec5942095aa031c3ed038 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera <alvhe...@alvh.no-ip.org> Date: Tue, 18 Oct 2022 12:11:46 +0200 Subject: [PATCH v2] Better handle interrupting TAP tests Set up a signal handler for INT/TERM so that we run our END block if we get them. In END, if the exit status indicates a problem, call _update_pid(-1) to improve chances of the stop working in case start() hasn't returned yet. Also, change END's teardown_node() so that it passes fail_ok=>1, so that if a node fails to stop, we still stop the other nodes in the same test. --- src/test/perl/PostgreSQL/Test/Cluster.pm | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index d7c13318b0..17856d69c8 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -1545,7 +1545,13 @@ END foreach my $node (@all_nodes) { - $node->teardown_node; + # During unclean termination (which could be a signal or some + # other failure), we're not sure that the status of our nodes + # has been correctly set up already, so try and detect their + # status so that we can shut them down properly. + $node->_update_pid(-1) if $exit_code != 0; + + $node->teardown_node(fail_ok => 1); # skip clean if we are requested to retain the basedir next if defined $ENV{'PG_TEST_NOCLEAN'}; @@ -1564,13 +1570,15 @@ END Do an immediate stop of the node +Any optional extra parameter is passed to ->stop. + =cut sub teardown_node { - my $self = shift; + my ($self, %params) = @_; - $self->stop('immediate'); + $self->stop('immediate', %params); return; } @@ -2922,6 +2930,13 @@ sub corrupt_page_checksum return; } +# +# Signal handlers +# +$SIG{TERM} = $SIG{INT} = sub { + die "death by signal"; +}; + =pod =back -- 2.30.2