We just discussed this in detail with Simon, and it looks like we have
5 (!) different but related problems:
1) The original problem of freeze then crash, leaving too high values in
relminxid and datminxid. If you then run vacuum, it might truncate CLOG
and you lose the commit status of the records that were supposed to be
To fix this, we need to WAL log freezing as already discussed.
2) vactuple_get_minxid doesn't take into account xmax's of tuples that
have HEAP_XMAX_INVALID set. That's a problem:
transaction 1001 - BEGIN; DELETE FROM foo where key = 1;
transaction 1001 - ROLLBACK;
transaction 1002 - VACUUM foo;
VACUUM foo will set relminxid to 1002, because HEAP_XMAX_INVALID was set
on the tuple (possibly by vacuum itself) that the deletion that rolled
back touched. However, that hint-bit update hasn't hit the disk yet, so
after recovery, the tuple will have an xmax of 1001 with no hint-bit,
and relminxid is 1002.
The simplest fix for this issue is to ignore the HEAP_XMAX_INVALID hint
bit, and take any xmax other than InvalidXid into account when
calculating the relminxid.
3) If you recover from a PITR backup (or have a hot stand-by), with base
backup that's more than 4 billion transactions older than the newest WAL
record, the clog entries of old transactions in the base backup will
overlap with the clog entries of new transactions that are in the WAL
records. This is the problem you also pointed out below.
To fix this, we need to emit a WAL record when truncating the clog. We
must also make sure that recovery of any WAL record type doesn't rely on
clog, because if we truncate the clog and then crash, recovery won't
have the clog available for the old transactions. At the moment,
TruncateCLog issues a checkpoint to protect from that but that's not
going to work when rolling forward logs in PITR, right?
4) If we fix issue 2 so that vactuple_get_minxid always takes xmax into
account, even if HEAP_XMAX_INVALID is set, a tuple with an aborted xmax
will keep us from advancing relminxid and truncating clog etc. That
doesn't lead to data corruption, but you will eventually hit the
transaction wrap-around limit. We don't have the same problem with xmin,
because we freeze tuples that are older than FreezeLimit to avoid it,
but we don't do that for xmax.
To fix this, replace any xmax older than FreezeLimit with InvalidXid
during vacuum. That also needs to be WAL logged.
5) We don't freeze tuples that are in RECENTLY_DEAD or
DELETE_IN_PROGRESS state. That doesn't lead to data corruption, but it
might make you hit the transaction wrap-around limit. That can happen if
you have a transaction that deletes or updates a very old, but not yet
frozen tuple. If you run vacuum while the deleting transaction is in
progress, vacuum won't freeze the tuple, and won't advance the
wrap-around limit because of the old tuple. That's not serious if the
deleting transaction commits, because the next vacuum will then remove
the tuple, but if it aborts, we might run into the same problem on the
next vacuum, and the next one, and the next one, until we reach the
To fix this, simply do the freezing for tuples in RECENTLY_DEAD and
DELETE_IN_PROGRESS states as well,
Am I missing something? Finding this many bugs makes me nervous...
Simon volunteered to make the clog changes for 3 because it's a PITR
related issue. I can write a patch/patches for the other changes if it
Tom Lane wrote:
"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:
Tom Lane wrote:
I think it's premature to start writing
patches until we've decided how this really needs to work.
Not logging hint-bit updates seems safe to me. As long as we have the
clog, the hint-bit is just a hint. The problem with freezing is that
after freezing tuples, the corresponding clog page can go away.
Actually clog can go away much sooner than that, at least in normal
operation --- that's what datvacuumxid is for, to track where we can
truncate clog. Maybe it's OK to say that during WAL replay we keep it
all the way back to the freeze horizon, but I'm not sure how we keep the
system from wiping clog it still needs right after switching to normal
operation. Maybe we should somehow not xlog updates of datvacuumxid?
Another thing I'm concerned about is the scenario where a PITR
hot-standby machine tracks a master over a period of more than 4 billion
transactions. I'm not sure what will happen in the slave's pg_clog
directory, but I'm afraid it won't be good :-(
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend