On Mon, Feb 1, 2016 at 4:58 PM, Andres Freund <and...@anarazel.de> wrote:
> Hi, > > currently if, when not in standby mode, we can't read a checkpoint > record, we automatically fall back to the previous checkpoint, and start > replay from there. > > Doing so without user intervention doesn't actually seem like a good > idea. While not super likely, it's entirely possible that doing so can > wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a > tablespace being dropped - going back to the previous checkpoint very > well could lead to replay not finishing, as the directory to create > files in doesn't even exist. > As there's, afaics, really no "legitimate" reasons for needing to go > back to the previous checkpoint I don't think we should do so in an > automated fashion. > > All the cases where I could find logs containing "using previous > checkpoint record at" were when something else had already gone pretty > badly wrong. Now that obviously doesn't have a very large significance, > because in the situations where it "just worked" are unlikely to be > reported... > > Am I missing a reason for doing this by default? > Learning by reading here... http://www.postgresql.org/docs/current/static/wal-internals.html """ After a checkpoint has been made and the log flushed, the checkpoint's position is saved in the file pg_control. Therefore, at the start of recovery, the server first reads pg_control and then the checkpoint record; then it performs the REDO operation by scanning forward from the log position indicated in the checkpoint record. Because the entire content of data pages is saved in the log on the first page modification after a checkpoint (assuming full_page_writes is not disabled), all pages changed since the checkpoint will be restored to a consistent state. To deal with the case where pg_control is corrupt, we should support the possibility of scanning existing log segments in reverse order — newest to oldest — in order to find the latest checkpoint. This has not been implemented yet. pg_control is small enough (less than one disk page) that it is not subject to partial-write problems, and as of this writing there have been no reports of database failures due solely to the inability to read pg_control itself. So while it is theoretically a weak spot, pg_control does not seem to be a problem in practice. """ The above comment appears out-of-date if this post describes what presently happens. Also, I was under the impression that tablespace commands resulted in checkpoints so that the state of the file system could be presumed current... I don't know enough internals but its seems like we'd need to distinguish between an interrupted checkpoint (pull the plug during checkpoint) and one that supposedly completed without interruption but then was somehow corrupted (solar flares). The former seem legitimate for auto-skip while the later do not. David J.