Thanks, for looking into this and for creating the patch. > +1 for that approach. heap_lock_updated_tuple() actually does that check > too, but not for the first tuple.
I think this will for sure fix the problem, however we are still accessing completely unrelated tuples. It feels unsafe to access tuples that might have been vacuumed/pruned away. Is it possible for example that the page we are accessing has been truncated away in the meantime? > I guess that works too, but the downside is that we can't vacuum aborted > tuples as fast. I don't know why we need aborted tuples to be vacuumed so much faster than dead tuples caused by inserts/updates. I would think that generally you would have much more of those. In our code base we have chosen for this approach, since this fix also safeguards us for other potentially problematic cases where the ctid pointer is followed. >> * The initial tuple is assumed to be already locked. > But AFAICS that's not true. The caller holds the heavy-weight tuple This does not look true to me either. I simplified your patch a little bit by extracting common code to the heap_lock_updated_tuple() function. I also added the new test to the Makefile. Jasper Smit On Thu, Dec 11, 2025 at 5:30 PM Heikki Linnakangas <[email protected]> wrote: > > On 13/10/2025 10:48, Jasper Smit wrote: > > We found a bug in heap_lock_tuple(). It seems to be unsafe to follow > > the t_ctid pointer of updated tuples, > > in the presence of aborted tuples. Vacuum can mark aborted tuples > > LP_UNUSED, even if there are still > > visible tuples that have a t_ctid pointing to them. > > > > We would like to address this issue in postgres. Attached is a patch > > that modifies VACUUM behavior. > > Could you advise us on the best way to proceed? How can we get this > > patch included in postgres, or would you recommend > > pursuing an alternative approach? > > You're at the right place :-). Thanks for the script and the patch! > > > The following scenario will lead to an assertion in debug or to wrong > > results in release. > > > > 1. (session 1) A tuple is being UPDATEd in an active transaction. > > 2. (session 2) Another session locks that tuple, for example, SELECT > > .. FOR UPDATE. > > 3. heap_lock_tuple() calls HeapTupleSatisfiesUpdate(), which returns > > TM_BeingModified. > > It will record the xmax and t_ctid of the tuple. > > 4. Since the tuple is actively being updated, we need to wait for > > session 1 to complete. > > 5. The UPDATE of session 1 aborts, leaving the original tuple with an > > aborted xmax and t_ctid pointing to a > > tuple with an aborted xmin. > > 6. VACUUM marks aborted tuples as unused, regardless of whether they > > are still referenced by a visible tuple through t_ctid. > > As a result, the aborted tuple is marked unused. > > 7. An INSERT happens and the tuple is placed at the recycled location > > of the aborted tuple. > > 8. The newly INSERTed tuple is UPDATEd. > > 9. heap_lock_tuple() of session 2 resumes, and will next call > > heap_lock_updated_tuple() using the t_ctid which was recorded in step > > 3. > > Note that the code does not check the status of the transaction it > > was waiting on. > > It will proceed, regardless of the transaction it was waiting on > > committed or aborted. > > 10. heap_lock_updated_tuple() will now work on the inserted tuple of > > step 7. It will see that this tuple updated, > > and thus the return value of heap_lock_updated_tuple() will be > > TM_Updated. > > 11. This will eventually lead to the assertion (result == > > TM_WouldBlock) || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID) > > > > These concrete steps reproduce the problem: > > > > Used version REL_18_0 > > Run with configuration: > > io_combine_limit = '1' > > > > session 1: > > drop table if exists t; > > create table t (id int); > > insert into t (select generate_series(1, 452)); > > begin; > > update t set id = 1000 where id = 1; > > > > session 2: > > gdb: b heapam.c:5033 > > gdb: continue > > select * from t where id = 1 for update; > > > > session 1: > > abort; > > select * from t; > > VACUUM (TRUNCATE off); > > insert into t values (453) returning ctid; -- Should be (2,1) > > update t set id = 454 where id = 453 returning ctid; > > > > session 2: (should now be at the breakpoint) > > gdb: continue > > > > We get the assertion: > > (result == TM_WouldBlock) || !(tuple->t_data->t_infomask & > > HEAP_XMAX_INVALID). > > Stacktrace included as attachment. > > I can reproduce this. > > > We have also seen the error "t_xmin %u is uncommitted in tuple (%u,%u) > > to be updated in table \"%s\"" in our testing. > > We believe that this has a similar cause. However, we have not been > > able to come up with reproduction steps for that assertion > > specifically. > > Yeah, sounds quite plausible that you could get that too. > > > We think that this problem can be fixed in two different ways. > > 1. Check the commit status after waiting in heap_lock_updated_tuple(), > > and don't proceed checking the updated tuple when the commit has > > aborted. > > Hmm, I think heap_lock_updated_tuple() would still lock incorrect tuples > in that case, which could lead to unrelated transactions failing, or at > least waiting unnecessarily. > > > or check, like in heap_get_latest_tid(), if the xmin of the updated > > tuples matches the xmax of the old tuple. > > +1 for that approach. heap_lock_updated_tuple() actually does that check > too, but not for the first tuple. > > > 2. Change vacuum, to delay marking aborted tuples dead until no > > visible tuple will have a t_ctid pointing to that tuple. > > I guess that works too, but the downside is that we can't vacuum aborted > tuples as fast. > > Attached is a fix by checking the prior xmax in > heap_lock_updated_tuple() for the first tuple already. The first commit > adds your repro script as a regression test using an injection point. It > hits the assertion failure, or returns incorrect result if assertions > are disabled. The second commit fixes the bug and makes the test pass. > > The comment on heap_lock_updated_tuple says: > > > * The initial tuple is assumed to be already locked. > > But AFAICS that's not true. The caller holds the heavy-weight tuple > lock, to establish priority if there are multiple concurrent lockers, > but it doesn't stamp the initial tuple's xmax until after the > heap_lock_updated_tuple() call. I guess we just need to update that > comment, but I would love to get another pair of eyes on this, this code > is hairy. > > - Heikki
v2-0002-Fix-the-bug-with-priorXmax.patch
Description: Binary data
v2-0001-Add-repro-for-heap-locking-visibility-bug-as-a-tap.patch
Description: Binary data
