Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

David G. Johnston Mon, 01 Feb 2016 17:43:27 -0800

On Mon, Feb 1, 2016 at 5:48 PM, Andres Freund <and...@anarazel.de> wrote:


> On 2016-02-01 17:29:39 -0700, David G. Johnston wrote:
> > Learning by reading here...
> >
> > http://www.postgresql.org/docs/current/static/wal-internals.html
> > """
> > After a checkpoint has been made and the log flushed, the checkpoint's
> > position is saved in the file pg_control. Therefore, at the start of
> > recovery, the server first reads pg_control and then the checkpoint
> record;
> > then it performs the REDO operation by scanning forward from the log
> > position indicated in the checkpoint record. Because the entire content
> of
> > data pages is saved in the log on the first page modification after a
> > checkpoint (assuming full_page_writes is not disabled), all pages changed
> > since the checkpoint will be restored to a consistent state.
>
> > The above comment appears out-of-date if this post describes what
> > presently happens.
>
> Where do you see a conflict with what I wrote about? We store both the
> last and the previous checkpoint's location in pg_control. Or are you
> talking about:
>

Mainly the following...but the word I used was "out-of-date" and not
"conflict".  The present state seems to do the above, and then some.


> > To deal with the case where pg_control is corrupt, we should support the
> > possibility of scanning existing log segments in reverse order — newest
> to
> > oldest — in order to find the latest checkpoint. This has not been
> > implemented yet. pg_control is small enough (less than one disk page)
> that
> > it is not subject to partial-write problems, and as of this writing there
> > have been no reports of database failures due solely to the inability to
> > read pg_control itself. So while it is theoretically a weak spot,
> > pg_control does not seem to be a problem in practice.
>
> if so, no, that's not a out-of-date, as we simply store two checkpoint
> locations:
>  $ pg_controldata /srv/dev/pgdev-dev/|grep 'checkpoint location'
> Latest checkpoint location:           B3/2A730028
> Prior checkpoint location:            B3/2A72FFA0
>
>
The quote implies that only a single checkpoint is noted and that no
"searching" is performed - whether by scanning or by being told the
position of a previous one so that it can jump there immediately without
scanning backwards.  It isn't strictly the fact that we do not "scan"
backwards but the implications that arise in making that statement.  Maybe
this is being picky but if you cannot trust the value of "Latest checkpoint
location" then pg_control is arguably corrupt.  Corruption is not strictly
limited to "unable to be read" but does include "contains invalid data".

> Also, I was under the impression that tablespace commands resulted in
> > checkpoints so that the state of the file system could be presumed
> > current...
>
> That actually doesn't really make it any better - it forces the *latest*
> checkpoint, but if we can't read that, we'll start with the previous
> one...


> > I don't know enough internals but its seems like we'd need to distinguish
> > between an interrupted checkpoint (pull the plug during checkpoint) and
> one
> > that supposedly completed without interruption but then was somehow
> > corrupted (solar flares).  The former seem legitimate for auto-skip while
> > the later do not.
>
> I don't think such a distinction is really possible (or necessary). If
> pg_control is corrupted we won't even start, and if WAL is corrupted
> that badly we won't finish replay...
>

My takeaway from the above is that we should only record what we think is a
usable/readable/valid checkpoint location to "Latest checkpoint location"

 (LCL) and if the system is not able to use that information to perform a
successful recovery it should be allowed to die without using the value in
"Previous checkpoint location" - which becomes effectively ignored during
master recovery.

David J.

Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

Reply via email to