On Wed, Apr 14, 2021 at 7:11 AM Masahiko Sawada <sawada.m...@gmail.com> wrote: > If we create a table with vacuum_index_cleanup = off or execute VACUUM > with INDEX_CLEANUP = off, vacuum updates pg_stat_all_tables.n_dead_tup > to the number of HEAPTUPLE_RECENTLY_DEAD tuples. Whereas analyze > updates it to the sum of the number of HEAPTUPLE_DEAD/RECENTLY_DEAD > tuples and LP_DEAD line pointers. So if the table has many LP_DEAD > line pointers due to skipping index cleanup, autovacuum is triggered > every time after analyze/autoanalyze. This issue seems to happen also > on back branches, probably from 12 where INDEX_CLEANUP option was > introduced.
Hmm. > I think we can have heapam_scan_analyze_next_tuple() not count LP_DEAD > line pointer as lazy_scan_prune() does. Attached the patch for that. lazy_scan_prune() is concerned about what the state of the table *will be* when VACUUM finishes, based on its working assumption that index vacuuming and heap vacuuming always go ahead. This is exactly the same reason why lazy_scan_prune() will set LVPagePruneState.hastup to 'false' in the presence of an LP_DEAD item -- this is not how count_nondeletable_pages() considers if the same page 'hastup' much later on, right at the end of the VACUUM (it will only consider the page safe to truncate away if it now only contains LP_UNUSED items -- LP_DEAD items make heap/table truncation unsafe). In general accounting rules like this may need to work slightly differently across near-identical functions because of "being versus becoming" issues. It is necessary to distinguish between "being" code (e.g., this ANALYZE code, count_nondeletable_pages() and its hastup issue) and "becoming" code (e.g., lazy_scan_prune() ands its approach to counting "remaining" dead tuples as well as hastup-ness). I tend to doubt that your patch is the right approach because the two code paths already "agree" once you assume that the LP_DEAD items that lazy_scan_prune() sees will be gone at the end of the VACUUM. I do agree that this is a problem, though. Generally speaking, the "becoming" code from lazy_scan_prune() is not 100% sure that it will be correct in each case, for a large variety of reasons. But I think that we should expect it to be mostly correct. We definitely cannot allow it to be quite wrong all the time with some workloads. And so I agree that this is a problem for the INDEX_CLEANUP = off case, though it's equally an issue for the recently added failsafe mechanism. I do not believe that it is a problem for the bypass-indexes optimization, though, because that is designed to only be applied when there are practically zero LP_DEAD items. The optimization can make VACUUM record that there are zero dead tuples after the VACUUM finishes, even though there were in fact a very small non-zero number of dead tuples -- but that's not appreciably different from any of the other ways that the count of dead tuples could be inaccurate (e.g. concurrent opportunistic pruning). The specific tests that we apply inside lazy_vacuum() should make sure that autovacuum scheduling is never affected. The autovacuum scheduling code can safely "believe" that the indexes were vacuumed, because it really is the same as if there were precisely zero LP_DEAD items (or the same for all practical purposes). I'm not sure what to do, though. Both the INDEX_CLEANUP = off case and the failsafe case are only intended for emergencies. And it's hard to know what to do in a code path that is designed to rarely or never be used. -- Peter Geoghegan