Fix rare instability in recovery TAP test 009_twophase The phase of the test where we want to check that 2PC transactions prepared on a primary can be committed on a promoted standby relied on an immediate stop of the primary. This logic has a race condition: it could be possible that some records (most likely standby snapshot records) are generated on the primary before it finishes its shutdown, without the promoted standby know about them. When the primary is recycled as new standby, the test could fail because of a timeline fork as an effect of these extra records.
This fix takes care of the instability by doing a clean stop of the primary instead of a teardown (aka immediate stop), so as all records generated on the primary are sent to the promoted standby and flushed there. There is no need for a teardown of the primary in this test scenario: the commit of 2PC transactions on a promoted standby do not care about the state of the primary, only of the standby. This race is very hard to hit in practice, even slow buildfarm members like skink have a very low rate of reproduction. Alexander Lakhin has come up with a recipe to improve the reproduction rate a lot: - Enable -DWAL_DEBUG. - Patch the bgwriter so as standby snapshots are generated every milliseconds. - Run 009_twophase tests under heavy parallelism. With this method, the failure appears after a couple of iterations. With the fix in place, I have been able to run more than 50 iterations of the parallel test sequence, without seeing a failure. Issue introduced in 30820982b295, due to a copy-pasto coming from the surrounding tests. Thanks also to Hayato Kuroda for digging into the details of the failure. He has proposed a fix different than the one of this commit. Unfortunately, it relied on injection points, feature only available in v17. The solution of this commit is simpler, and can be applied to v14~v16. Reported-by: Alexander Lakhin <[email protected]> Discussion: https://postgr.es/m/[email protected] Backpatch-through: 14 Branch ------ REL_14_STABLE Details ------- https://git.postgresql.org/pg/commitdiff/bce98e49e8ace62c1018dc24dec443a50b347cb3 Modified Files -------------- src/test/recovery/t/009_twophase.pl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
