On Oct26, 2011, at 18:08 , Simon Riggs wrote: > On Wed, Oct 26, 2011 at 3:47 PM, Florian Pflug <f...@phlo.org> wrote: >> On Oct26, 2011, at 15:57 , Florian Pflug wrote: >> Thus, if the CLOG is extended after (or in the middle of) CheckPointGuts(), >> but >> before LogStandbySnapshot(), then we end up with a nextXid in the checkpoint >> whose CLOG page hasn't necessarily made it to the disk yet. The longer >> CheckPointGuts() >> takes to finish it's work the more likely it becomes (assuming that CLOG >> writing >> and syncing doesn't happen at the very end). This fits the OP's observation >> ob the >> problem vanishing when pg_start_backup() does an immediate checkpoint. > > As it turns out the correct fix is actually just to skip StartupClog() > until the end of recovery because it does nothing useful when executed > at that time. When I wrote the original code I remember thinking that > StartupClog() is superfluous at that point.
Hm, that fixes the problem in the hot standby case, but as I said in my reply to Chris Redekop, normal crash recovery is also at risk (although the probability of hitting the bug is much smaller there). Here's my reasoning from that other mail: Per my theory about the cause of the problem in my other mail, I think you might see StartupCLOG failures even during crash recovery, provided that wal_level was set to hot_standby when the primary crashed. Here's how 1) We start a checkpoint, and get as far as LogStandbySnapshot() 2) A backend does AssignTransactionId, and gets as far as GetTransactionoId(). The assigned XID requires CLOG extension. 3) The checkpoint continues, and LogStandbySnapshot () advances the checkpoint's nextXid to the XID assigned in (2). 4) We crash after writing the checkpoint record, but before the CLOG extension makes it to the disk, and before any trace of the XID assigned in (2) makes it to the xlog. Then StartupCLOG() would fail at the end of recovery, because we'd end up with a nextXid whose corresponding CLOG page doesn't exist. Quite aside from that concern, I think it's probably not a good idea for the nextXid value of a checkpoint to depend on whether wal_level was set to hot_standby or not. Our recovery code is already quite complex and hard to test, and this just adds one more combination that has to be thought about while coding and that needs to be tested. My suggestion is to fix the CLOG problem in that same way that you fixed the SUBTRANS problem, i.e. by moving LogStandbySnapshot() to before CheckPointGuts(). Here's what I image CreateCheckPoint() should look like: 1) LogStandbySnapshot() and fill out oldestActiveXid 2) Fill out REDO 3) Wait for concurrent commits 4) Fill out nextXid and the other fields 5) CheckPointGuts() 6) Rest It's then no longer necessary for LogStandbySnapshot() do modify the nextXid, since we fill out nextXid after LogStandbySnapshot() and will thus derive a higher value than LogStandbySnapshot() would have. We could then also fold GetOldestActiveTransactionId() back into your proposed LogStandbySnapshot() and thus don't need two ProcArray traversals. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers