On Wed, Apr 5, 2023 at 4:59 PM Peter Geoghegan <p...@bowt.ie> wrote: > I think that I agree. I think that the difficulty of tuning autovacuum > is the actual real problem. (Or maybe it's just very closely related > to the real problem -- the precise definition doesn't seem important.)
I agree, and I think that bad choices around what the parameters do are a big part of the problem. autovacuum_max_workers is one example of that, but there are a bunch of others. It's not at all intuitive that if your database gets really big you either need to raise autovacuum_vacuum_cost_limit or lower autovacuum_vacuum_cost_delay. And, it's not intuitive either that raising autovacuum_max_workers doesn't increase the amount of vacuuming that gets done. In my experience, it's very common for people to observe that autovacuum is running constantly, and to figure out that the number of running workers is equal to autovacuum_max_workers at all times, and to then conclude that they need more workers. So they raise autovacuum_max_workers and nothing gets any better. In fact, things might get *worse*, because the time required to complete vacuuming of a large table can increase if the available bandwidth is potentially spread across more workers, and it's very often the time to vacuum the largest tables that determines whether things hold together adequately or not. This kind of stuff drives me absolutely batty. It's impossible to make every database behavior completely intuitive, but here we have a parameter that seems like it is exactly the right thing to solve the problem that the user knows they have, and it actually does nothing on a good day and causes a regression on a bad one. That's incredibly poor design. The way it works at the implementation level is pretty kooky, too. The available resources are split between the workers, but if any of the relevant vacuum parameters are set for the table currently being vacuumed, then that worker gets the full resources configured for that table, and everyone else divides up the amount that's configured globally. So if you went and set the cost delay and cost limit for all of your tables to exactly the same values that are configured globally, you'd vacuum 3 times faster than if you relied on the identical global defaults (or N times faster, where N is the value you've picked for autovacuum_max_workers). If you have one really big table that requires continuous vacuuming, you could slow down vacuuming on that table through manual configuration settings and still end up speeding up vacuuming overall, because the remaining workers would be dividing the budget implied by the default settings among N-1 workers instead of N workers. As far as I can see, none of this is documented, which is perhaps for the best, because IMV it makes no sense. I think we need to move more toward a model where VACUUM just keeps up. Emergency mode is a step in that direction, because the definition of an emergency is that we're definitely not keeping up, but I think we need something less Boolean. If the database gets bigger or smaller or more or less active, autovacuum should somehow just adjust to that, without so much manual fiddling. I think it's good to have the possibility of some manual fiddling to handle problematic situations, but you shouldn't have to do it just because you made a table bigger. -- Robert Haas EDB: http://www.enterprisedb.com