On 27.10.2011 02:29, Florian Pflug wrote:
Per my theory about the cause of the problem in my other mail, I think you
might see StartupCLOG failures even during crash recovery, provided that
wal_level was set to hot_standby when the primary crashed. Here's how

1) We start a checkpoint, and get as far as LogStandbySnapshot()
2) A backend does AssignTransactionId, and gets as far as GetTransactionoId().
   The assigned XID requires CLOG extension.
3) The checkpoint continues, and LogStandbySnapshot () advances the
   checkpoint's nextXid to the XID assigned in (2).
4) We crash after writing the checkpoint record, but before the CLOG
   extension makes it to the disk, and before any trace of the XID assigned
   in (2) makes it to the xlog.

Then StartupCLOG() would fail at the end of recovery, because we'd end up
with a nextXid whose corresponding CLOG page doesn't exist.

No, clog extension is WAL-logged while holding the XidGenLock. At step 3, LogStandbySnapshot() would block until the clog-extension record is written to WAL, so crash recovery would see and replay that record before calling StartupCLOG().

That can happen during hot standby, though, because StartupCLOG() is called earlier.

My suggestion is to fix the CLOG problem in that same way that you fixed
the SUBTRANS problem, i.e. by moving LogStandbySnapshot() to before

Here's what I image CreateCheckPoint() should look like:

1) LogStandbySnapshot() and fill out oldestActiveXid
2) Fill out REDO
3) Wait for concurrent commits
4) Fill out nextXid and the other fields
5) CheckPointGuts()
6) Rest

It's then no longer necessary for LogStandbySnapshot() do modify
the nextXid, since we fill out nextXid after LogStandbySnapshot() and
will thus derive a higher value than LogStandbySnapshot() would have.

Hmm, I don't think that fully fixes the problem. Even if you're certain that CheckPointGuts() has fsync'd the clog page to disk, VACUUM might decide to truncate it away again while the checkpoint is running.

  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to