On Thu, Oct 19, 2017 at 7:21 AM, Robert Haas <robertmh...@gmail.com> wrote: > The commit message for a5736bf7 doesn't say anything about a race; it > just claims that it is fixing traversal of half-frozen update chains, > without making any reference to how such a thing as a half-frozen > update chain came to exist in the first place. I think the commit > that talks about that race condition is 20b65522.
Well, the race was that the wrong tuple is pruned, because of naivety in the heap_prune_chain() update chain handling, which is what started work that became commit a5736bf7. That's by far the worst consequence of the traversal of half-frozen update chain business that a5736bf7's commit message describes, since we know it leads to wrong answers to index scans (and REINDEX fails with "failed to find parent tuple"). It's the only consequence that's been directly observed, but probably not the only one that's possible. Most other places that suffer the problem are in obscure-to-users paths like EvalPlanQual(). > /me studies the problem for a while. > > What's bothering me about this is: how is cutoff_xid managing to be a > new enough transaction ID for this to happen in the first place? The > cutoff XID should certainly be older than anything any current > snapshot can see, because it's computed using GetOldestXmin() -- which > means that XIDs older than the cutoff can't still be running (except > maybe if old_snapshot_threshold is set to a non-default value). The problem that we directly observed during pruning, which caused "failed to find parent tuple" even after the first freeze-the-dead commit (commit 20b65522), was that the mixture of frozen and unfrozen tuples in a hot chain caused heap_prune_chain() to make an incorrect conclusion about the continuity of a HOT chain. We'd prune the wrong tuples -- it failed to find the visible tuple at the end of the HOT chain, and respect that that meant it could not prune down to a stub ItemId (blow away what it *incorrectly* thought was an entire HOT chain). Now that the problem is fixed by a5736bf7, this test case will prune and get an LP_REDIRECT ItemId (not an LP_DEAD one), as we're no longer confused about the continuity of HOT chains within heap_prune_chain(). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers