On Thu, Jan 13, 2022 at 4:27 PM Peter Geoghegan <p...@bowt.ie> wrote: > 1. Cases where our inability to get a cleanup lock signifies nothing > at all about the page in question, or any page in the same table, with > the same workload. > > 2. Pathological cases. Cases where we're at least at the mercy of the > application to do something about an idle cursor, where the situation > may be entirely hopeless on a long enough timeline. (Whether or not it > actually happens in the end is less significant.)
Sure. I'm worrying about case (2). I agree that in case (1) waiting for the lock is almost always the wrong idea. > I think that you're focussing on individual VACUUM operations, whereas > I'm more concerned about the aggregate effect of a particular policy > over time. I don't think so. I think I'm worrying about the aggregate effect of a particular policy over time *in the pathological cases* i.e. (2). > This is my concern -- what I've called category 2 cases have this > exact quality. So given that, why not freeze what you can, elsewhere, > on other pages that don't have the same issue (presumably the vast > vast majority in the table)? That way you have the best possible > chance of recovering once the DBA gets a clue and fixes the issue. That's the part I'm not sure I believe. Imagine a table with a gigantic number of pages that are not yet all-visible, a small number of all-visible pages, and one page containing very old XIDs on which a cursor holds a pin. I don't think it's obvious that not waiting is best. Maybe you're going to end up vacuuming the table repeatedly and doing nothing useful. If you avoid vacuuming it repeatedly, you still have a lot of work to do once the DBA locates a clue. I think there's probably an important principle buried in here: the XID threshold that forces a vacuum had better also force waiting for pins. If it doesn't, you can tight-loop on that table without getting anything done. > That's kind of what I meant. The difference between 50 million and 150 > million is rather unclear indeed. So having accepted that that might > be true, why not be open to the possibility that it won't turn out to > be true in the long run, for any given table? With the enhancements > from the patch series in place (particularly the early freezing > stuff), what do we have to lose by making the FreezeLimit XID cutoff > for freezing much higher than your typical vacuum_freeze_min_age? > Maybe the same as autovacuum_freeze_max_age or vacuum_freeze_table_age > (it can't be higher than that without also making these other settings > become meaningless, of course). We should probably distinguish between the situation where (a) an adverse pin is held continuously and effectively forever and (b) adverse pins are held frequently but for short periods of time. I think it's possible to imagine a small, very hot table (or portion of a table) where very high concurrency means there are often pins. In case (a), it's not obvious that waiting will ever resolve anything, although it might prevent other problems like infinite looping. In case (b), a brief wait will do a lot of good. But maybe that doesn't even matter. I think part of your argument is that if we fail to update relfrozenxid for a while, that really isn't that bad. I think I agree, up to a point. One consequence of failing to immediately advance relfrozenxid might be that pg_clog and friends are bigger, but that's pretty minor. Another consequence might be that we might vacuum the table more times, which is more serious. I'm not really sure that can happen to a degree that is meaningful, apart from the infinite loop case already described, but I'm also not entirely sure that it can't. -- Robert Haas EDB: http://www.enterprisedb.com