Re: Race conditions in 019_replslot_limit.pl

Andres Freund Mon, 30 May 2022 12:02:15 -0700

Hi,

On 2022-03-27 22:37:34 -0700, Andres Freund wrote:
> On 2022-03-27 17:36:14 -0400, Tom Lane wrote:
> > Andres Freund <and...@anarazel.de> writes:
> > > I still feel like there's something off here. But that's probably not 
> > > enough
> > > to keep causing failures. I'm inclined to leave the debugging in for a bit
> > > longer, but not fail the test anymore?
> > 
> > WFM.
> 
> I've done so now.


I did look over the test results a couple times since then and once more
today. There were a few cases with pretty significant numbers of iterations:

The highest is
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2022-04-07%2022%3A14%3A03
showing:
# multiple walsenders active in iteration 19

It's somewhat interesting that the worst case was just around the feature
freeze, where the load on my buildfarm animal boxes was higher than normal.


I comparison to earlier approaches, with the current in-tree approach, we
don't do anything when hitting the "problem", other than wait. Which does give
us additional information - afaics there's nothing at all indicating that some
other backend existed allowing the replication slot drop to finish.

It just looks like for reasons I still do not understand, removing a directory
and 2 files or so takes multiple seconds (at least ~36 new connections, 18
pg_usleep(100_100)), while there are no other indications of problems.

I also still don't have a theory why this suddenly started to happen.


Unless somebody has another idea, I'm planning to remove all the debugging
code added, but keep the retry based approach in 019_replslot_limit.pl, so we
don't again get all the spurious failures.

Greetings,

Andres Freund

Re: Race conditions in 019_replslot_limit.pl

Reply via email to