Re: Bugs in TOAST handling, OID assignment and redo recovery

Heikki Linnakangas Tue, 10 Apr 2018 06:38:52 -0700

On 10/04/18 13:29, Pavan Deolasee wrote:

Hello,


One of our 2ndQuadrant support customers recently reported a sudden rush of
TOAST errors post a crash recovery, nearly causing an outage. Most errors
read like this:

ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While we could bring back the cluster to normal quickly using some
workarounds, I investigated this in more detail and identified two long
standing bugs in TOAST as well as redo recovery.


Great detective work!

ISTM that the right fix is to teach toast_save_datum() to check for
existence of duplicate chunk_id by scanning the table with the same
SnapshotToast that it later uses to fetch the tuples. We already do that in
case of toast rewrite, but not for regular inserts. I propose to do that
for regular path too, as done in the attached patch.

It would seem more straightforward to add a snapshot parameter toGetNewOidWithIndex(), so that the this one caller could passSnapshotToast, while others pass SnapshotDirty. With your patch, youcheck the index twice: first with SnapshotDirty, inGetNewOidWithIndex(), and then with SnapshotToast, in the caller.

If I'm reading the rewrite case correctly, it's a bit different andquite special. In the loop with GetNewOidWithIndex(), it needs to checkthat the OID is unused in two tables, the old TOAST table, and the newone. You can only pass one index to GetNewOidWithIndex(), so it needs tocheck the second index manually. It's not because of the snapshot issue.Although I wonder if we should be using SnapshotToast in thatGetNewOidWithIndex() call, too. I.e. if we should be checking both theold and the new toast table with SnapshotToast.

I think the right fix is to simply ignore the nextOid counter while
replaying ONLINE checkpoint record. We must have already initialised
ShmemVariableCache->nextOid
to the value stored in this (or some previous) checkpoint record when redo
recovery is started. As and when we see XLOG_NEXTOID record, we would
maintain the shared memory counter correctly. If we don't see any
XLOG_NEXTOID record, the value we started with must be the correct value. I
see no problem even when OID wraps around during redo recovery.
XLOG_NEXTOID should record that correctly.

Agreed. With nextXid, we advance ShmemVariableCache->nextXid if thevalue in the online checkpoint record is greater thanShmemVariableCache->nextXid. But we don't have such a wraparound-awareconcept of "greater than" for OIDs. Ignoring the online checkpointrecord's nextOid value seem fine to me.


- Heikki

Re: Bugs in TOAST handling, OID assignment and redo recovery

Reply via email to