Hi, On Mon, Nov 24, 2025 at 11:07:41AM +0100, Álvaro Herrera wrote: > On 2025-Nov-24, Michael Banck wrote: > > > In general I doubt how much those gauges (as oppposed to counters) only > > pertaining to the last checkpoint are useful in pg_stat_checkpointer. > > What would be the use case for those two values? > > I think it's useful to know how long checkpoint has to work. It's a bit > lame to have only one duration (the last one), but at least with this > arrangement you can have external monitoring software connect to the > server, extract that value and save it somewhere else. Monitoring > systems do this all the time, and we've been waiting for a better > implementation to store monitoring data inside Postgres for years. I > think we shouldn't block this proposal just because of this issue, > because it can clearly be useful.
I don't know - what happens if the monitoring systems reads those values every minute, but then suddenly Postgres checkpoints every 20 seconds due to a traffic spike? It would just not see those additional checkpoints in this case, no? What monitoring systems do (have to do) is query write_time + sync_time as total_time in pg_stat_checkpointer and store that along with the timestamp of the query. Then you (maybe awkwardly) generate a graph of the checkpoint durations over time. > However, I'm not sure I'm very interested in knowing only the duration > of the checkpoint. I mean, much of the time the duration is going to be > whatever fraction of the checkpoint timeout you have as > checkpoint_completion_target, right? Which includes sleeps. Yeah, that is the other thing I was wondering about, but did not mention in my mail, good point. > So I think you really want two durations: one is the duration itself, > and the other is what fraction of that did the checkpointer sleep in > order to achieve that duration. So you know how much time > checkpointer spent trying to get the operating system do stuff rather > than just sit there waiting. We already have that data, kinda, in > write_time and sync_time, but those are cumulative rather than just > for the last one. I think that we either have "last timestamp whatever" or "total", but I think we don't have "last duration" anywhere? > (I guess you can have the monitoring system compute > the deltas as it finds each new checkpoint.) I'm not sure how good > this system is. Right, this is what I meant above. But from what I see on PG18, total_time just seems tbe write_time + sync_time, do we have the sleep somewhere? > In the past, I looked at a couple of monitoring dashboards offered by > cloud vendors, searching for anything valuable in terms of checkpoints. > What I saw was very disappointing -- mostly just "how many checkpoints > per minute", which is mostly flat zero with periodic spikes. Totally > useless. Does anybody know if some vendor has good charts for this? > Also, if we were to add this new proposed duration, how could these > charts improve? I don't have a good answer here. Michael
