On Mon, Dec 6, 2021 at 12:07 PM Robert Haas <robertmh...@gmail.com> wrote: > So does this. If some of the table is now all-visible when it wasn't > before, it's certainly a good guess that the portions that still > aren't have about the same distribution of dead tuples that they did > before ... although the other direction is less clear: it seems > possible that newly not-all-visible pages have fewer dead tuples than > ones which have been not-all-visible for a while. But you have to make > some guess.
To me, it seems natural to accept and even embrace the inherent uncertainty about the number of dead tuples. We should model our current belief about how many dead tuples are in the table as a probability density function (or something along the same lines). There is a true "sample space" here. Once we focus on not-all-visible pages, using authoritative VM info, many kinds of misestimations are clearly impossible. For example, there are only so many not-all-visible heap pages, and they can only hold so many dead tuples (up to MaxHeapTuplesPerPage). This is a certainty. The number of dead tuples in the table is an inherently dynamic thing, which makes it totally dissimilar to the pg_statistics-based stats. And so a single snapshot of a point in time is inherently much less useful -- we ought to keep a few sets of old statistics within our new pgstat_report_analyze() -- maybe 3 or 5. Each set of statistics includes the total number of relpages at the time, the total number of not-all-visible pages (i.e. interesting pages) at the time, and the average number of live and dead tuples encountered. This is interpreted (along with a current visibilitymap_count()) to get our so-called probability density function (probably not really a PDF, probably something simpler and only vaguely similar) within autovacuum.c. It just occurred to me that it makes zero sense that pgstat_report_vacuum() does approximately the same thing as pgstat_report_analyze() -- we make no attempt to compensate for the fact that the report is made by VACUUM specifically, and so reflects the state of each page in the table immediately after it was processed by VACUUM. ISTM that this makes it much more likely to appear as an underestimate later on -- pgstat_report_vacuum() gets the furthest possible thing from a random sample. Whereas if we had more context (specifically that there are very few or even 0 all-visible pages), it wouldn't hurt us at all, and we wouldn't need to have special pgstat_report_vacuum()-only heuristics. -- Peter Geoghegan