On Wed, Oct 26, 2011 at 3:47 PM, Florian Pflug <f...@phlo.org> wrote: > On Oct26, 2011, at 15:57 , Florian Pflug wrote: >> As you said, the CLOG page corresponding to nextId >> *should* always be accessible at the start of recovery (Unless whole file >> has been removed by VACUUM, that is). So we shouldn't need to extends CLOG. >> Yet the error suggest that the CLOG is, in fact, too short. What I said >> is that we shouldn't apply any fix (for the CLOG problem) before we >> understand >> the reason for that apparent contradiction. > > Ha! I think I've got a working theory. > > In CreateCheckPoint(), we determine the nextId that'll go into the checkpoint > record, and then call CheckPointGuts() which does the actual writing and > fsyncing. > So far, that fine. If a transaction ID is assigned before we compute the > checkpoint's nextXid, we'll extend the CLOG accordingly, and CheckPointGuts() > will > make sure the new CLOG page goes to disk. > > But, if wal_level = hot_standby, we also call LogStandbySnapshot() in > CreateCheckPoint(), and we do that *after* CheckPointGuts(). Which would be > fine too, except that LogStandbySnapshot() re-assigned the *current* value of > ShmemVariableCache->nextXid to the checkpoint's nextXid field. > > Thus, if the CLOG is extended after (or in the middle of) CheckPointGuts(), > but > before LogStandbySnapshot(), then we end up with a nextXid in the checkpoint > whose CLOG page hasn't necessarily made it to the disk yet. The longer > CheckPointGuts() > takes to finish it's work the more likely it becomes (assuming that CLOG > writing > and syncing doesn't happen at the very end). This fits the OP's observation > ob the > problem vanishing when pg_start_backup() does an immediate checkpoint.
This is the correct explanation. I've just come back into Wifi range, so I was just writing to you with this explanation but your original point that nextxid must be wrong deserves credit. OTOH I was just waiting to find out what the reason for the physical read was, rather than guessing. Notice that the nextxid value isn't wrong, its just not the correct value to use for starting clog. As it turns out the correct fix is actually just to skip StartupClog() until the end of recovery because it does nothing useful when executed at that time. When I wrote the original code I remember thinking that StartupClog() is superfluous at that point. Brewing a patch now. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers