Dear Bertrand, I'm also working on the thread to resolve the random failure.
> Yes, that's also my understanding. It's also easy to "simulate" by adding > a checkpoint on the primary and a long enough sleep after we launched our sql > in > wait_until_vacuum_can_remove(). Thanks for letting me know. For me, it could be reporoduced only the sleep(). > > So, if the above is correct, the reason for generating extra > > xl_running_xacts on primary is Vacuum followed by Insert on primary > > via below part of test: > > $node_primary->safe_psql( > > 'testdb', qq[VACUUM $vac_option verbose $to_vac; > > INSERT INTO flush_wal DEFAULT VALUES;]); > > I'm not sure, I think a xl_running_xacts could also be generated (for example > by > the checkpointer) before the vacuum (should the system be slow enough). I think you are right. When I added `CHECKPOINT` and sleep after the user SQLs, I got the below ordering. This meant that RUNNING_XACTS are generated before the prune triggered by the vacuum. ``` ... lsn: 0/04025218, prev 0/040251A0, desc: RUNNING_XACTS nextXid 766 latestCompletedXid 765 oldestRunningXid 766 ... lsn: 0/04028FD0, prev 0/04026FB0, desc: PRUNE_ON_ACCESS snapshotConflictHorizon: 765,... ... ``` > I'm not sure, as I think a xl_running_xacts could still be generated after > we execute "our sql" meaning: > > " > $node_primary->safe_psql('testdb', qq[$sql]); > " > > and before we launch the new DML. In that case I guess the issue could still > happen. > > OTOH If we create the new DML "before" we launch "our sql" then the test > would also fail for both active and inactive slots because that would not > invalidate any slots. > > I did observe the above with the attached changes (just changing the PREPARE > TRANSACTION location). I've also tried the idea with the living transaction via background_psql(), but I got the same result. The test could fail when RUNNING_XACTS record was generated before the transaction starts. > I agree, but I'm not sure it's doable as it looks to me that we should prevent > the catalog xmin to advance to advance past the conflict point while still > generating a conflict point. Will try to give it another thought. One primitive idea for me was to stop the walsender/pg_recvlogical process for a while. SIGSTOP signal for pg_recvlogical may do the idea, but ISTM it could not be on windows. See 019_replslot_limit.pl. Best regards, Hayato Kuroda FUJITSU LIMITED