On Sun, Dec 5, 2021 at 12:28 AM Peter Geoghegan <p...@bowt.ie> wrote: > I wonder why we're counting the number of dead tuples (or LP_DEAD stub > items) in the relation as a whole in ANALYZE's acquire_sample_rows() > function. Wouldn't it make more sense to focus on the "live vs dead > tuple properties" of heap pages that are not known to be all-visible > when we generate statistics for our pgstat_report_analyze() report? > These statistic collector stats are only for the benefit of autovacuum > scheduling -- and so they're *consumed* in a way that is totally > different to the nearby pg_statistic stats.
I think this could be the right idea. I'm not certain that it is, but it does sound believable. > This new approach also buys us the ability to extrapolate a new > estimated number of dead tuples using old, stale stats. The stats can > be combined with the authoritative/known number of not-all-visible > pages right this second, since it's cheap enough to *accurately* > determine the total number of not-all-visible pages for a heap > relation by calling visibilitymap_count(). My guess is that this would > be much more accurate in practice: provided the original average > number of dead/live tuples (tuples per not-all-visible block) was > still reasonably accurate, the extrapolated "total dead tuples right > now" values would also be accurate. So does this. If some of the table is now all-visible when it wasn't before, it's certainly a good guess that the portions that still aren't have about the same distribution of dead tuples that they did before ... although the other direction is less clear: it seems possible that newly not-all-visible pages have fewer dead tuples than ones which have been not-all-visible for a while. But you have to make some guess. -- Robert Haas EDB: http://www.enterprisedb.com