On 11.12.2012 17:04, Fujii Masao wrote:
On Tue, Dec 11, 2012 at 6:14 PM, Heikki Linnakangas
<hlinnakan...@vmware.com>  wrote:
On 11.12.2012 08:07, Kyotaro HORIGUCHI wrote:

The cause is that CheckRecoveryConsistency() is called before rm_redo(),
as Horiguchi-san suggested upthead. Imagine the case where
minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
record. When restarting the server with that minRecoveryPoint,
the followings would happen, and then PANIC occurs.

1. XLOG_SMGR_TRUNCATE record is read.
2. CheckRecoveryConsistency() is called, and database is marked as
      consistent since we've reached minRecoveryPoint (i.e., the location
      of XLOG_SMGR_TRUNCATE).
3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
      found.
4. Since the database has already been marked as consistent, an invalid
      page leads to PANIC.


Exactly.

In smgr_redo, EndRecPtr which is pointing the record next to
SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other
hand, during the second startup of the standby,
CheckRecoveryConsistency checks for consistency by
XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at
just BEFORE the SMGR_TRUNCATE record is applied. So
reachedConsistency becomes true just before the SMGR_TRUNCATE
record will be applied. Bang!


Ah, I see. I thought moving CheckRecoveryConsistency was just a nice-to-have
thing, so that we'd notice that we're consistent earlier, so didn't pay
attention to that part.

I think the real issue here is that CheckRecoveryConsistency should not be
comparing minRecoveryPoint against recoveryLastRecPtr, not EndRecPtr as it
currently does. EndRecPtr points to the end of the last record *read*, while
recoveryLastRecPtr points to the end of the last record *replayed*. The
latter is what CheckRecoveryConsistency should be concerned with.

BTW, it occurs to me that we have two variables in the shared struct that
are almost the same thing: recoveryLastRecPtr and replayEndRecPtr. The
former points to the end of the last record successfully replayed, while
replayEndRecPtr points to the end of the last record successfully replayed,
or begin replayed at the moment. I find that confusing, so I suggest that we
rename recoveryLastRecPtr to make that more clear. Included in the attached
patch.

The patch looks good. I confirmed that the PANIC error didn't happen,
with the patch.

Thanks. After thinking about this more, I think it's still not quite right. Now that we fix the check in CheckRecoveryConsistency, we have to move the call to CheckRecoveryConsistency to after the rm_redo call. Otherwise you don't enter hot standby mode after replaying the last record, the one ending at minRecoveryPoint. You have to wait for one more record to be read (but not replayed), so that CheckRecoveryConsistency gets called and we let backends in.

As the code stands, the bad placement of CheckRecoveryConsistency is compensated by the bug that we open up for hot standby one record too early. The net result is that we open up for hot standby just before replaying the last record that makes us consistent. The window for someone to see the inconsistent state is really small, postmaster will have to receive the signal, start accepting connections, and you have connect and run a query, all before the startup process gets around to replay the record its already read into memory.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to