Hi, On 2021-04-21 16:28:26 -0400, Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: > > On 2021-04-21 15:51:38 -0400, Stephen Frost wrote: > > > It does seem like we have some trade-offs here to weigh, but > > > pg_control is indeed quite small.. > > > > What do you mean by that? That the overhead of writing it out more > > frequently wouldn't be too bad? Or that we shouldn't "unnecessarily" add > > more fields to it? > > Mostly just that the added overhead in writing it out more frequently > wouldn't be too bad. > > Seems the missing bit here is "how often, and how do we make that > happen?" and then we can discuss if there's reason to be concerned that > it would be 'too frequent' or cause too much additional overhead in > terms of IO/fsyncs.
The number of writes and the number of fsyncs of the control file wouldn't necessarily have to be the same. We could e.g. update the file once per segment, but only fsync it at a lower cadence. We already rely on handling writes-without-fsync of the control file (which is trivial due to the <= 512 byte limit). Another interesting question is where we'd do the update from. It seems like it ought to be some background process: I can see doing it in the checkpointer - but there's a few phases that can take a while (e.g. sync) where currently don't call something like CheckpointWriteDelay() on a regular basis. I also can see doing it in bgwriter - none of the work it does should take all that long, and minor increases in latency ought not to have much of an impact. Wal writer seems less suitable, some workloads are sensitive to it not getting around doing what it ought to do. > Adding fields runs the risk of crossing the > threshold where we feel that we can safely assume all of it will make it > to disk in one shot and therefore there's more reason to not add extra > fields to it, if possible. Yea, we really should stay below 512 bytes (sector size). We're at 296 right now, with 20 bytes lost to padding. If we got close to the limit we could easily move some of the contents out of pg_control - we e.g. don't need to write out all the compile time values all the time, they could live in a file similar to PG_VERSION instead. So I'm not too concerned right now. But we also don't need to add anything, given that we already have minRecoveryPoint. Greetings, Andres Freund