Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

Tom Lane Sat, 22 Apr 2017 11:41:14 -0700

I wrote:
> Whoa.  This just turned into a much larger can of worms than I expected.
> How can it be that processes are getting assertion crashes and yet the
> test framework reports success anyway?  That's impossibly
> broken/unacceptable.

I poked into this on my laptop, where I'm able to reproduce both assertion
failures. The reason the one in subtrans.c is not being detected by the
test framework is that it happens late in the run and the test doesn't
do anything more with that server than shut it down. "pg_ctl stop"
actually notices there's a problem; for instance, the log on tern for
this step shows

### Stopping node "master" using mode immediate
# Running: pg_ctl -D
/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata
-m immediate stop
pg_ctl: PID file
"/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata/postmaster.pid"
does not exist
Is server running?
# No postmaster PID

However, PostgresNode::stop blithely ignores the failure exit from
pg_ctl. I think it should use system_or_bail() not just system_log(),
so that we'll notice if the server is not running when we expect it
to be. It's conceivable that there should also be a "stop_if_running"
method that allows the case where the server is not running, but so
far as I can find, no existing TAP test needs such a behavior --- and
it certainly shouldn't be the default.

The walsender.c crash is harder to detect because the postmaster very
nicely recovers and restarts its children. That's great for robustness,
but it sucks for testing. I think that we ought to disable that in
TAP tests, which we can do by having PostgresNode::init include
"restart_after_crash = off" in the postgresql.conf modifications it
applies. Again, it's conceivable that some tests would not want that,
but there is no existing test that seems to need crash recovery on,
and I don't see a good argument for it being default for test purposes.

As best I've been able to tell, the reason why the walsender.c crash
is detected when it's detected is that the TAP script chances to try
to connect while the recovery is happening:

connection error: 'psql: FATAL: the database system is in recovery mode'
while running 'psql -XAtq -d port=57718 host=/tmp/_uP8FKEynq dbname='postgres'
-f - -v ON_ERROR_STOP=1' at
/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm
line 1173.

The window for that is not terribly wide, explaining why the detection
is unreliable even when the crash does occur.

In short then, I propose the attached patch to make these cases fail
more reliably. We might extend this later to allow the old behaviors
to be explicitly opted-into, but we don't seem to need that today.

regards, tom lane

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index cb84f1f..f8a0581 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -402,6 +402,7 @@ sub init
 	open my $conf, '>>', "$pgdata/postgresql.conf";
 	print $conf "\n# Added by PostgresNode.pm\n";
 	print $conf "fsync = off\n";
+	print $conf "restart_after_crash = off\n";
 	print $conf "log_line_prefix = '%m [%p] %q%a '\n";
 	print $conf "log_statement = all\n";
 	print $conf "port = $port\n";
@@ -675,7 +676,7 @@ sub stop
 	$mode = 'fast' unless defined $mode;
 	return unless defined $self->{_pid};
 	print "### Stopping node \"$name\" using mode $mode\n";
-	TestLib::system_log('pg_ctl', '-D', $pgdata, '-m', $mode, 'stop');
+	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-m', $mode, 'stop');
 	$self->{_pid} = undef;
 	$self->_update_pid;
 }

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

Reply via email to