On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote: > I think those are two independent issues - knowing that the snapshot is > from the last checkpoint, and knowing that it's correct (not corrupted). > And yeah, we should be careful about fsync/durable_rename.
Yeah, that's bugging me as well. I don't really get why we would not want durability at shutdown for this data. So how about switching the end of pgstat_write_statsfile() to use durable_rename()? Sounds like an independent change, worth on its own. > Yeah, I was wondering about the same thing - can't this mean we fail to > start autovacuum? Let's say we delete a significant fraction of a huge > table, but then we crash before the next checkpoint. With this patch we > restore the last stats snapshot, which can easily have > > n_dead_tup | 0 > n_mod_since_analyze | 0 > > for the table. And thus we fail to notice the table needs autovacuum. > AFAIK we run autovacuum on all tables with missing stats (or am I > wrong?). That's what's happening on replicas after switchover/failover > too, right? That's the opposite, please look at relation_needs_vacanalyze(). If a table does not have any stats found in pgstats, like on HEAD after a crash, then autoanalyze is skipped and autovacuum happens only for the anti-wrap case. For the database case, rebuild_database_list() uses pgstat_fetch_stat_dbentry() three times, discards entirely databases that have no stats. Again, there should be a stats entry post a crash upon a reconnection. So there's an argument in saying that the situation does not get worse here and that we actually may improve odds by keeping a trace of the previous stats, no? In most cases, there would be a stats entry post-crash as an INSERT or a scan would have re-created it, but the stats would reflect the state registered since the last crash recovery (even on HEAD, a bulk delete bumping the dead tuple counter would not be detected post-crash). The case of spiky workloads may impact the decision-making, of course, but at least we'd allow autovacuum to take some decision over giving up entirely based on some previous state of the stats. That sounds like a win for me with steady workloads, less for bulby workloads.. > It'd not be such an issue if we updated stats during recovery, but I > think think we're doing that. Perhaps we should, which might also help > on replicas - no idea if it's feasible, though. Stats on replicas are considered an independent thing AFAIU (scans are counted for example in read-only queries). If we were to do that we may want to split stats handling between nodes in standby state and crash recovery. Not sure if that's worth the complication. First, the stats exist at node-level. -- Michael
signature.asc
Description: PGP signature