[HACKERS] Avoiding adjacent checkpoint records

Tom Lane Wed, 06 Jun 2012 12:09:19 -0700

In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the
rule for when to skip checkpoints on the grounds that not enough
activity has happened since the last one.  However, that commit left the
comment block about it in a nonsensical state:


    * If this isn't a shutdown or forced checkpoint, and we have not switched
    * to the next WAL file since the start of the last checkpoint, skip the
    * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
    * when the system is idle. That wastes log space, and more importantly it
    * exposes us to possible loss of both current and previous checkpoint
    * records if the machine crashes just as we're writing the update.
    * (Perhaps it'd make even more sense to checkpoint only when the previous
    * checkpoint record is in a different xlog page?)

The new code entirely fails to prevent writing adjacent checkpoint
records, because what it checks is the distance from the previous
checkpoint's REDO pointer, not the previous checkpoint record itself.
So the concern raised in the last two sentences of the comment isn't
being addressed at all: if we corrupt the current page of WAL while
trying to write the new checkpoint record, we risk losing the previous
checkpoint record too.  Should the system then crash, there is enough
logic to back up to the second previous checkpoint record and roll
forward from there --- but since we've lost the last checkpoint and up
to one page's worth of preceding WAL records, there is no guarantee that
we'll manage to reach a database state that is consistent with data
already flushed out to disk during the last checkpoint.

I started to make a quick patch to add an additional check on the
location of the previous checkpoint record, so that we'd skip a new
checkpoint unless we'd moved to a new page of WAL.  However, if we
really want to take this risk seriously, ISTM that allowing adjacent
checkpoint records is bad all the time, not only for non-forced
checkpoints.

What I'm now thinking is that a more appropriate way to address that
risk is to force a skip to a new page (not segment) of WAL after we
write a checkpoint record.  This won't waste much WAL space in view
of the new rule to avoid checkpoints more than once per segment on
average.

On the other hand, you could argue that this concern is entirely
hypothetical, and we're already basically assuming that once a WAL
record has been flushed to disk it's safe there even if we're still
writing more stuff into the same page.  If we don't want to assume
that, then any XLogFlush would have to include skip-to-new-page,
and that's not going to be cheap.

Thoughts?

                        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Avoiding adjacent checkpoint records

Reply via email to