Re: Race conditions in 019_replslot_limit.pl

Kyotaro Horiguchi Mon, 30 May 2022 18:31:31 -0700

At Mon, 30 May 2022 12:01:55 -0700, Andres Freund <and...@anarazel.de> wrote in 
> Hi,
> 
> On 2022-03-27 22:37:34 -0700, Andres Freund wrote:
> > On 2022-03-27 17:36:14 -0400, Tom Lane wrote:
> > > Andres Freund <and...@anarazel.de> writes:
> > > > I still feel like there's something off here. But that's probably not 
> > > > enough
> > > > to keep causing failures. I'm inclined to leave the debugging in for a 
> > > > bit
> > > > longer, but not fail the test anymore?
> > > 
> > > WFM.
> > 
> > I've done so now.
> 
> I did look over the test results a couple times since then and once more
> today. There were a few cases with pretty significant numbers of iterations:
> 
> The highest is
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2022-04-07%2022%3A14%3A03
> showing:
> # multiple walsenders active in iteration 19
> 
> It's somewhat interesting that the worst case was just around the feature
> freeze, where the load on my buildfarm animal boxes was higher than normal.


If disk is too busy, CheckPointReplicationSlots may take very long.

> I comparison to earlier approaches, with the current in-tree approach, we
> don't do anything when hitting the "problem", other than wait. Which does give
> us additional information - afaics there's nothing at all indicating that some
> other backend existed allowing the replication slot drop to finish.

preventing?  Only checkpointer and a client backend that ran "SELECT * FROM
pg_stat_activity" are the only processes that are running during the
blocking state.

> It just looks like for reasons I still do not understand, removing a
directory
> and 2 files or so takes multiple seconds (at least ~36 new connections, 18
> pg_usleep(100_100)), while there are no other indications of problems.

That fact suports that CheckPointReplicationSlots took long time.

> I also still don't have a theory why this suddenly started to happen.

Maybe we need to see the load of disks at that time OS-wide. Couldn't
compiler or other non-postgres tools put significant load to disks?

> Unless somebody has another idea, I'm planning to remove all the debugging
> code added, but keep the retry based approach in 019_replslot_limit.pl, so we
> don't again get all the spurious failures.

+1.  

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Race conditions in 019_replslot_limit.pl

Reply via email to