I've been experimenting with a change to pg_ctl, which I'll post
separately, to reduce its reaction time so that it reports success
more quickly after a wait for postmaster start/stop. I found one
case in "make check-world" that got a failure when I reduced the
reaction time to ~1ms. That's the very last test in 001_stream_rep.pl,
"cascaded slot xmin reset after startup with hs feedback reset", and
the cause appears to be that it's not allowing any delay time for a
replication slot's state to update after a postmaster restart.
This seems worth fixing independently of any possible code changes,
because it shows that this test could fail on a slow or overloaded
machine. I couldn't find any instances of such a failure in the
buildfarm archives, but that may have a lot to do with the fact that
owners of slow buildfarm animals are (mostly?) not running this test.
Some experimentation says that the minimum delay needed to make it
work reliably on my workstation is about 100ms, so a simple patch
along the lines of the attached might be good enough. I find this
approach conceptually dissatisfying, though, since it's still
potentially vulnerable to the failure under sufficient load.
I wonder if there is an easy way to improve that ... maybe convert
to something involving poll_query_until?
regards, tom lane
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 266d27c..8d6edd2 100644
*************** isnt($xmin, '', 'cascaded slot xmin non-
*** 265,270 ****
--- 265,272 ----
# Xmin from a previous run should be cleared on startup.
+ sleep(1); # need some delay before interrogating slot xmin
($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
'cascaded slot xmin reset after startup with hs feedback reset');
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: