> On Apr 22, 2017, at 11:40 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> 
> I wrote:
>> Whoa.  This just turned into a much larger can of worms than I expected.
>> How can it be that processes are getting assertion crashes and yet the
>> test framework reports success anyway?  That's impossibly
>> broken/unacceptable.
> 
> I poked into this on my laptop, where I'm able to reproduce both assertion
> failures.  The reason the one in subtrans.c is not being detected by the
> test framework is that it happens late in the run and the test doesn't
> do anything more with that server than shut it down.  "pg_ctl stop"
> actually notices there's a problem; for instance, the log on tern for
> this step shows
> 
> ### Stopping node "master" using mode immediate
> # Running: pg_ctl -D 
> /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata
>  -m immediate stop
> pg_ctl: PID file 
> "/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata/postmaster.pid"
>  does not exist
> Is server running?
> # No postmaster PID
> 
> However, PostgresNode::stop blithely ignores the failure exit from
> pg_ctl.  I think it should use system_or_bail() not just system_log(),
> so that we'll notice if the server is not running when we expect it
> to be.  It's conceivable that there should also be a "stop_if_running"
> method that allows the case where the server is not running, but so
> far as I can find, no existing TAP test needs such a behavior --- and
> it certainly shouldn't be the default.
> 
> The walsender.c crash is harder to detect because the postmaster very
> nicely recovers and restarts its children.  That's great for robustness,
> but it sucks for testing.  I think that we ought to disable that in
> TAP tests, which we can do by having PostgresNode::init include
> "restart_after_crash = off" in the postgresql.conf modifications it
> applies.  Again, it's conceivable that some tests would not want that,
> but there is no existing test that seems to need crash recovery on,
> and I don't see a good argument for it being default for test purposes.
> 
> As best I've been able to tell, the reason why the walsender.c crash
> is detected when it's detected is that the TAP script chances to try
> to connect while the recovery is happening:
> 
> connection error: 'psql: FATAL:  the database system is in recovery mode'
> while running 'psql -XAtq -d port=57718 host=/tmp/_uP8FKEynq 
> dbname='postgres' -f - -v ON_ERROR_STOP=1' at 
> /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm
>  line 1173.
> 
> The window for that is not terribly wide, explaining why the detection
> is unreliable even when the crash does occur.
> 
> In short then, I propose the attached patch to make these cases fail
> more reliably.  We might extend this later to allow the old behaviors
> to be explicitly opted-into, but we don't seem to need that today.
> 
>                       regards, tom lane

I pulled fresh sources with your latest commit, 
7d68f2281a4b56834c8e5648fc7da0b73b674c45,
and the tests consistently fail (5 out of 5 attempts) for me on my laptop with:

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C recovery check
for extra in contrib/test_decoding; do 
/Applications/Xcode.app/Contents/Developer/usr/bin/make -C '../../..'/$extra 
DESTDIR='/Users/mark/vanilla/bitoctetlength'/tmp_install install 
>>'/Users/mark/vanilla/bitoctetlength'/tmp_install/log/install.log || exit; done
rm -rf /Users/mark/vanilla/bitoctetlength/src/test/recovery/tmp_check/log
cd . && TESTDIR='/Users/mark/vanilla/bitoctetlength/src/test/recovery' 
PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/bin:$PATH" 
DYLD_LIBRARY_PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/lib"
 PGPORT='65432' 
PG_REGRESS='/Users/mark/vanilla/bitoctetlength/src/test/recovery/../../../src/test/regress/pg_regress'
 prove -I ../../../src/test/perl/ -I .  t/*.pl
t/001_stream_rep.pl .................. ok     
t/002_archiving.pl ................... ok   
t/003_recovery_targets.pl ............ ok   
t/004_timeline_switch.pl ............. Bailout called.  Further testing 
stopped:  system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed
make[2]: *** [check] Error 255
make[1]: *** [check-recovery-recurse] Error 2
make: *** [check-world-src/test-recurse] Error 2

The first time, I had some local changes, but I've stashed them and am still 
getting the error each of
these next four times.  I'm compiling as follows:

make distclean; make clean; ./configure --enable-cassert --enable-tap-tests 
--enable-depend && make -j4 && make check-world

Are the errors expected now?

mark

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to