In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the rule for when to skip checkpoints on the grounds that not enough activity has happened since the last one. However, that commit left the comment block about it in a nonsensical state:
* If this isn't a shutdown or forced checkpoint, and we have not switched * to the next WAL file since the start of the last checkpoint, skip the * checkpoint. The idea here is to avoid inserting duplicate checkpoints * when the system is idle. That wastes log space, and more importantly it * exposes us to possible loss of both current and previous checkpoint * records if the machine crashes just as we're writing the update. * (Perhaps it'd make even more sense to checkpoint only when the previous * checkpoint record is in a different xlog page?) The new code entirely fails to prevent writing adjacent checkpoint records, because what it checks is the distance from the previous checkpoint's REDO pointer, not the previous checkpoint record itself. So the concern raised in the last two sentences of the comment isn't being addressed at all: if we corrupt the current page of WAL while trying to write the new checkpoint record, we risk losing the previous checkpoint record too. Should the system then crash, there is enough logic to back up to the second previous checkpoint record and roll forward from there --- but since we've lost the last checkpoint and up to one page's worth of preceding WAL records, there is no guarantee that we'll manage to reach a database state that is consistent with data already flushed out to disk during the last checkpoint. I started to make a quick patch to add an additional check on the location of the previous checkpoint record, so that we'd skip a new checkpoint unless we'd moved to a new page of WAL. However, if we really want to take this risk seriously, ISTM that allowing adjacent checkpoint records is bad all the time, not only for non-forced checkpoints. What I'm now thinking is that a more appropriate way to address that risk is to force a skip to a new page (not segment) of WAL after we write a checkpoint record. This won't waste much WAL space in view of the new rule to avoid checkpoints more than once per segment on average. On the other hand, you could argue that this concern is entirely hypothetical, and we're already basically assuming that once a WAL record has been flushed to disk it's safe there even if we're still writing more stuff into the same page. If we don't want to assume that, then any XLogFlush would have to include skip-to-new-page, and that's not going to be cheap. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers