On 10/04/18 13:29, Pavan Deolasee wrote:
Hello,

One of our 2ndQuadrant support customers recently reported a sudden rush of
TOAST errors post a crash recovery, nearly causing an outage. Most errors
read like this:

ERROR: unexpected chunk number 0 (expected 1) for toast value nnnn

While we could bring back the cluster to normal quickly using some
workarounds, I investigated this in more detail and identified two long
standing bugs in TOAST as well as redo recovery.

Great detective work!

ISTM that the right fix is to teach toast_save_datum() to check for
existence of duplicate chunk_id by scanning the table with the same
SnapshotToast that it later uses to fetch the tuples. We already do that in
case of toast rewrite, but not for regular inserts. I propose to do that
for regular path too, as done in the attached patch.

It would seem more straightforward to add a snapshot parameter to GetNewOidWithIndex(), so that the this one caller could pass SnapshotToast, while others pass SnapshotDirty. With your patch, you check the index twice: first with SnapshotDirty, in GetNewOidWithIndex(), and then with SnapshotToast, in the caller.

If I'm reading the rewrite case correctly, it's a bit different and quite special. In the loop with GetNewOidWithIndex(), it needs to check that the OID is unused in two tables, the old TOAST table, and the new one. You can only pass one index to GetNewOidWithIndex(), so it needs to check the second index manually. It's not because of the snapshot issue. Although I wonder if we should be using SnapshotToast in that GetNewOidWithIndex() call, too. I.e. if we should be checking both the old and the new toast table with SnapshotToast.

I think the right fix is to simply ignore the nextOid counter while
replaying ONLINE checkpoint record. We must have already initialised
ShmemVariableCache->nextOid
to the value stored in this (or some previous) checkpoint record when redo
recovery is started. As and when we see XLOG_NEXTOID record, we would
maintain the shared memory counter correctly. If we don't see any
XLOG_NEXTOID record, the value we started with must be the correct value. I
see no problem even when OID wraps around during redo recovery.
XLOG_NEXTOID should record that correctly.

Agreed. With nextXid, we advance ShmemVariableCache->nextXid if the value in the online checkpoint record is greater than ShmemVariableCache->nextXid. But we don't have such a wraparound-aware concept of "greater than" for OIDs. Ignoring the online checkpoint record's nextOid value seem fine to me.

- Heikki

Reply via email to