Hi, On 2022-03-27 22:37:34 -0700, Andres Freund wrote: > On 2022-03-27 17:36:14 -0400, Tom Lane wrote: > > Andres Freund <and...@anarazel.de> writes: > > > I still feel like there's something off here. But that's probably not > > > enough > > > to keep causing failures. I'm inclined to leave the debugging in for a bit > > > longer, but not fail the test anymore? > > > > WFM. > > I've done so now.
I did look over the test results a couple times since then and once more today. There were a few cases with pretty significant numbers of iterations: The highest is https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2022-04-07%2022%3A14%3A03 showing: # multiple walsenders active in iteration 19 It's somewhat interesting that the worst case was just around the feature freeze, where the load on my buildfarm animal boxes was higher than normal. I comparison to earlier approaches, with the current in-tree approach, we don't do anything when hitting the "problem", other than wait. Which does give us additional information - afaics there's nothing at all indicating that some other backend existed allowing the replication slot drop to finish. It just looks like for reasons I still do not understand, removing a directory and 2 files or so takes multiple seconds (at least ~36 new connections, 18 pg_usleep(100_100)), while there are no other indications of problems. I also still don't have a theory why this suddenly started to happen. Unless somebody has another idea, I'm planning to remove all the debugging code added, but keep the retry based approach in 019_replslot_limit.pl, so we don't again get all the spurious failures. Greetings, Andres Freund