On 7 May 2013 10:57, Heikki Linnakangas <hlinnakan...@vmware.com> wrote: > While testing the bug from the "Assertion failure at standby promotion", I > bumped into a different bug in fast promotion. When the first checkpoint > after fast promotion is performed, there is no guarantee that the > checkpointer process is running with the correct, new, ThisTimeLineID. In > CreateCheckPoint(), we have this: > >> /* >> * An end-of-recovery checkpoint is created before anyone is >> allowed to >> * write WAL. To allow us to write the checkpoint record, >> temporarily >> * enable XLogInsertAllowed. (This also ensures ThisTimeLineID is >> * initialized, which we need here and in AdvanceXLInsertBuffer.) >> */ >> if (flags & CHECKPOINT_END_OF_RECOVERY) >> LocalSetXLogInsertAllowed(); > > > That ensures that ThisTimeLineID is updated when performing an > end-of-recovery checkpoint, but it doesn't get executed with fast promotion. > The consequence is that the checkpoint is created with the old timeline, and > subsequent recovery from it will fail.
LocalSetXLogInsertAllowed() is called by CreateEndOfRecoveryRecord(), but in fast promotion this is called by Startup process, not Checkpointer process. So there is no longer an explicit call to InitXLOGAccess() in the checkpointer process via a checkpoint with flag CHECKPOINT_END_OF_RECOVERY. However, there is a call to RecoveryInProgress() at the top of the main loop of the checkpointer, which does explicitly state that it "initializes TimeLineID if it's not set yet." The checkpointer makes the decision about whether to run a restartpoint or a checkpoint directly from the answer to that, modified only in the case of an CHECKPOINT_END_OF_RECOVERY. So the appropiate call is made and I don't agree with the statement that this "doesn't get executed with fast promotion", or the title of the thread. So while I believe that the checkpointer might have an incorrect TLI and that you've seen a bug, what isn't clear is that the checkpointer is the only process that would see an incorrect TLI, or why such processes see an incorrect TLI. It seems equally likely at this point that the TLI may be set incorrectly somehow and that is why it is being read incorrectly. I see that the comment in InitXLOGAccess() is incorrect "ThisTimeLineID doesn't change so we need no lock to copy it", which seems worth correcting since there's no need to save time in a once only function. Continuing to investigate further. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers