At Mon, 30 May 2022 12:01:55 -0700, Andres Freund <and...@anarazel.de> wrote in > Hi, > > On 2022-03-27 22:37:34 -0700, Andres Freund wrote: > > On 2022-03-27 17:36:14 -0400, Tom Lane wrote: > > > Andres Freund <and...@anarazel.de> writes: > > > > I still feel like there's something off here. But that's probably not > > > > enough > > > > to keep causing failures. I'm inclined to leave the debugging in for a > > > > bit > > > > longer, but not fail the test anymore? > > > > > > WFM. > > > > I've done so now. > > I did look over the test results a couple times since then and once more > today. There were a few cases with pretty significant numbers of iterations: > > The highest is > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2022-04-07%2022%3A14%3A03 > showing: > # multiple walsenders active in iteration 19 > > It's somewhat interesting that the worst case was just around the feature > freeze, where the load on my buildfarm animal boxes was higher than normal.
If disk is too busy, CheckPointReplicationSlots may take very long. > I comparison to earlier approaches, with the current in-tree approach, we > don't do anything when hitting the "problem", other than wait. Which does give > us additional information - afaics there's nothing at all indicating that some > other backend existed allowing the replication slot drop to finish. preventing? Only checkpointer and a client backend that ran "SELECT * FROM pg_stat_activity" are the only processes that are running during the blocking state. > It just looks like for reasons I still do not understand, removing a directory > and 2 files or so takes multiple seconds (at least ~36 new connections, 18 > pg_usleep(100_100)), while there are no other indications of problems. That fact suports that CheckPointReplicationSlots took long time. > I also still don't have a theory why this suddenly started to happen. Maybe we need to see the load of disks at that time OS-wide. Couldn't compiler or other non-postgres tools put significant load to disks? > Unless somebody has another idea, I'm planning to remove all the debugging > code added, but keep the retry based approach in 019_replslot_limit.pl, so we > don't again get all the spurious failures. +1. regards. -- Kyotaro Horiguchi NTT Open Source Software Center