Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

Florian Pflug Wed, 26 Oct 2011 09:01:11 -0700

On Oct26, 2011, at 17:36 , Chris Redekop wrote:
> > And I think they also reported that if they didn't run hot standby,
> > but just normal recovery into a new master, it didn't have the problem
> > either, i.e. without hotstandby, recovery ran, properly extended the
> > clog, and then ran as a new master fine.
> 
> Yes this is correct...attempting to start as hotstandby will produce the
> pg_clog error repeatedly and then without changing anything else, just
> turning hot standby off it will start up successfully.


Yup, because with hot standby disabled (on the client side), StartupCLOG()
happens after recovery has completed. That, at the very least, makes the
problem very unlikely to occur in the non-hot-standby case. I'm not sure
it's completely impossible, though.

Per my theory about the cause of the problem in my other mail, I think you
might see StartupCLOG failures even during crash recovery, provided that
wal_level was set to hot_standby when the primary crashed. Here's how

1) We start a checkpoint, and get as far as LogStandbySnapshot()
2) A backend does AssignTransactionId, and gets as far as GetTransactionoId().
   The assigned XID requires CLOG extension.
3) The checkpoint continues, and LogStandbySnapshot () advances the
   checkpoint's nextXid to the XID assigned in (2).
4) We crash after writing the checkpoint record, but before the CLOG
   extension makes it to the disk, and before any trace of the XID assigned
   in (2) makes it to the xlog.

Then StartupCLOG() would fail at the end of recovery, because we'd end up
with a nextXid whose corresponding CLOG page doesn't exist.

> > This fits the OP's observation ob the
> > problem vanishing when pg_start_backup() does an immediate checkpoint.
> 
> Note that this is *not* the behaviour I'm seeing....it's possible it happens
> more frequently without the immediate checkpoint, but I am seeing it happen
> even with the immediate checkpoint.

Yeah, I should have said "of the problem's likelihood decreasing" instead
of "vanishing". The point is, the longer the checkpoint takes, the higher
the chance the nextId is advanced far enough to require a CLOG extension.

That alone isn't enough to trigger the error - the CLOG extension must also
*not* make it to the disk before the checkpoint completes - but it's
a required precondition for the error to occur.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

Reply via email to