I had suggested something more that just cost limit, throttling which would be re-startable vacuum - https://www.postgresql.org/message-id/CAPdcCKpvZiRCoDxQoo9mXxXAK8w=bx5nqdttgzvhv2suxp0...@mail.gmail.com .
It may not be difficult to predict patterns of idle periods with cloud infrastructure and monitoring now-a-days. If we keep manual vacuum going in those idle periods, then there would be much less chance of auto-vacuum getting disruptive. This can be built with extensions or inside the engine. However, this change is a bit bigger than just a config parameter. It didn't get much traction. - Jay Sudrik On Tue, Jun 18, 2024 at 1:09 AM Robert Haas <robertmh...@gmail.com> wrote: > Hi, > > As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest > problem with autovacuum as it exists today is that the cost delay is > sometimes too low to keep up with the amount of vacuuming that needs > to be done. I sketched a solution during the talk, but it was very > complicated, so I started to try to think of simpler ideas that might > still solve the problem, or at least be better than what we have > today. > > I think we might able to get fairly far by observing that if the > number of running autovacuum workers is equal to the maximum allowable > number of running autovacuum workers, that may be a sign of trouble, > and the longer that situation persists, the more likely it is that > we're in trouble. So, a very simple algorithm would be: If the maximum > number of workers have been running continuously for more than, say, > 10 minutes, assume we're falling behind and exempt all workers from > the cost limit for as long as the situation persists. One could > criticize this approach on the grounds that it causes a very sudden > behavior change instead of, say, allowing the rate of vacuuming to > gradually increase. I'm curious to know whether other people think > that would be a problem. > > I think it might be OK, for a couple of reasons: > > 1. I'm unconvinced that the vacuum_cost_delay system actually prevents > very many problems. I've fixed a lot of problems by telling users to > raise the cost limit, but virtually never by lowering it. When we > lowered the delay by an order of magnitude a few releases ago - > equivalent to increasing the cost limit by an order of magnitude - I > didn't personally hear any complaints about that causing problems. So > disabling the delay completely some of the time might just be fine. > > 1a. Incidentally, when I have seen problems because of vacuum running > "too fast", it's not been because it was using up too much I/O > bandwidth, but because it's pushed too much data out of cache too > quickly. A long overnight vacuum can evict a lot of pages from the > system page cache by morning - the ring buffer only protects our > shared_buffers, not the OS cache. I don't think this can be fixed by > rate-limiting vacuum, though: to keep the cache eviction at a level > low enough that you could be certain of not causing trouble, you'd > have to limit it to an extremely low rate which would just cause > vacuuming not to keep up. The cure would be worse than the disease at > that point. > > 2. If we decided to gradually increase the rate of vacuuming instead > of just removing the throttling all at once, what formula would we use > and why would that be the right idea? We'd need a lot of state to > really do a calculation of how fast we would need to go in order to > keep up, and that starts to rapidly turn into a very complicated > project along the lines of what I mooted in Vancouver. Absent that, > the only other idea I have is to gradually ramp up the cost limit > higher and higher, which we could do, but we would have no idea how > fast to ramp it up, so anything we do here feels like it's just > picking random numbers and calling them an algorithm. > > If you like this idea, I'd like to know that, and hear any further > thoughts you have about how to improve or refine it. If you don't, I'd > like to know that, too, and any alternatives you can propose, > especially alternatives that don't require crazy amounts of new > infrastructure to implement. > > -- > Robert Haas > EDB: http://www.enterprisedb.com > > >