On Thu, May 2, 2019 at 10:28 AM Andres Freund <and...@anarazel.de> wrote: > It'd be good if somebody could make a pass over the safety mechanisms in > heap_prepare_freeze_tuple(). I added them at some point, after a data > corrupting bug related to freezing, but they really were more intended > as a secondary layer of defense, rather than the primary one. My > understanding is still that we assume that we never should reach > heap_prepare_freeze_tuple() for something that is below the horizon, > even after this change, right?
I think so. This code is hit if the tuple wasn't dead yet at the time that heap_page_prune() decided not to prune it, but it is dead by the time we retest the tuple status. But the same value of OldestXmin was used for both checks, so the only way that can really happen is if the XID in the tuple header was running before and is no longer running. However, if the XID was running at the time that heap_page_prune() ran, than OldestXmin certainly can't be newer than that XID. And therefore the value we're intending to set for relfrozenxid has surely got to be older, so we could hardly prune using OldestXmin as the threshold and then relfrozenxid to a newer XID. Actually, I now believe that my original concern here was exactly backwards. Prior to the logic change in this commit, with index vacuum disabled, a tuple that becomes dead between the heap_page_prune() check and the lazy_scan_heap() check would have followed the tupgone = true path. That would cause it to be treated as if the second vacuum pass were going to remove it - i.e. not frozen. But with index cleanup disabled, there will never be a second vacuum pass. So a tuple that was actually being kept was not sent to heap_prepare_freeze_tuple() with, basically, no justification. Imagine for example that the tuple was not old in terms of its XID age, but its MXID age was somehow ancient. Then we'd fail to freeze it on the theory that it was going to be removed, but yet not remove it. Oops. The revised logic - which is as per Tom's suggestion - does the right thing, which is treat the tuple as one we've chosen to keep. While looking at this code, I think I may have spotted another bug, or at least a near-bug. lazy_record_dead_tuple() thinks it's OK to just forget about dead tuples if there's not enough memory, which I think is OK in the normal case where the dead tuple has been truncated to a dead line pointer. But in the tupgone = true case it's NOT ok, because in that case we're leaving behind not just a dead line pointer but an actual tuple which we have declined to freeze on the assumption that it will be removed later. But if lazy_record_dead_tuple() forgets about it, then it won't be removed later. That could lead to tuples remaining that precede the freeze horizons. The reason why this may be only a near-bug is that we seem to try pretty hard to make sure that we'll never call that function in the first place without enough space being available. Still, it seems to me that it would be safer to change the code to look like: if (vacrelstats->num_dead_tuples >= vacrelstats->max_dead_tuples) elog(ERROR, "oh crap"); -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company