On Sun, Oct 27, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote: > 27.10.2024 07:09, Noah Misch wrote: > > On Sat, Oct 26, 2024 at 11:49:36AM -0700, Noah Misch wrote: > > > intra-grant-inplace-db.spec got a novel failure today: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sarus&dt=2024-10-26%2014%3A08%3A58 > > > > > > The isolationtester_waiting query is supposed to detect that step vac2 is > > > blocked. vac2 didn't finish within the timeout, but > > > isolationtester_waiting > > > never considered it blocked. > > > ... work on reproducing this.
The attached demo reproduces $SUBJECT for me. The clue was 'CONTEXT: while scanning block 0 of relation "pg_catalog.pg_database"'. That vacuum_error_callback() message indicated VACUUM_ERRCB_PHASE_SCAN_HEAP as the current phase. That implies vac2 had not reached any inplace update, any LOCKTAG_TUPLE, or any part of vac_update_datfrozenxid(). I bet vac2 was stuck in LockBufferForCleanup(). Concurrently, a sequence of autovacuum workers held a buffer pin blocking that cleanup. The demo makes two changes: a. Start intra-grant-inplace-db.spec step vac2 just after some autovacuum worker is blocked on the xid of the uncommitted GRANT. While blocked, the worker holds a pin on the one pg_database page. vac2 needs LockBufferForCleanup($THAT_PAGE). b. Make LockBufferForCleanup() sleep 10ms after WAIT_EVENT_BUFFER_PIN, to give the next autovacuum worker time to win the pinning race. LockBufferForCleanup() waits for a pin count to fall to 1. While waiting, its techniques are less sophisticated than what we have for lock.c heavyweight locks. UnpinBufferNoOwner() notifies the cleanup waiter, but nothing stops another backend from pinning before the cleanup waiter reacts. We have code to cancel an autovacuum for the purpose of liberating a heavyweight lock, but we don't have code like that to free a pin. pg_isolation_test_session_is_blocked() doesn't detect buffer pin blockages. Two fix candidates look most promising: 1. Unpin before heap_inplace_lock() sleeps. 2. Change intra-grant-inplace-db.spec to test freezing only MXIDs, not XIDs. Let's do (1), as attached. Apart from some arguably-convoluted call stacks, this has no disadvantages. An earlier version did unpin. postgr.es/m/20240822073200.4f.nmi...@google.com stopped that, arguing, "heap_update() doesn't optimize that way". In light of $SUBJECT, I no longer see heap_update() as the applicable standard. Since autovacuum can run anytime, it has an extra duty to be non-disruptive. heap_update() during auto-analyze won't wait much, because analyze takes a self-exclusive table lock and is the sole writer of stats tables. Inplace updates during autovacuum are different. They do contend with other transactions. Since we lack code to cancel an autovacuum to free a pin, a long-lived pin in autovacuum is more disruptive than a long-lived lock in autovacuum. While I've not tried it, I expect (2) would work as follows. pg_database won't contain MXIDs, so lazy_scan_noprune() would approve continuing without the cleanup lock. Unlike (1), this wouldn't help concurrency outside the test. > FWIW, there was a similar failure in August: [1], and I also could not > reproduce that locally, yet wrote a preliminary analysis at [2] in the > Unsorted section, in the hope to see it again and continue investigation. > > [1] > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=iguana&dt=2024-08-29%2013%3A57%3A57 > [2] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures Thanks. I had not known about iguana's failure. What are the chances that the buffer pin could explain that one?
Author: Noah Misch <n...@leadboat.com> Commit: Noah Misch <n...@leadboat.com> [not for commit] demo pin starvation from autovacuum inplace update Test with intra-grant-inplace-db.spec. Changes here: a. Start intra-grant-inplace-db.spec step vac2 just after some autovacuum worker is blocked on the xid of the uncommitted GRANT. While blocked, the worker holds a pin on the one pg_database page. vac2 needs LockBufferForCleanup(that-page). b. Make LockBufferForCleanup() sleep 10ms after WAIT_EVENT_BUFFER_PIN, to give the next autovacuum worker time to win the pinning race. diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 0f02bf6..5268294 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -5344,8 +5344,17 @@ LockBufferForCleanup(Buffer buffer) SetStartupBufferPinWaitBufId(-1); } else + { ProcWaitForSignal(WAIT_EVENT_BUFFER_PIN); + /* + * Sleep to induce starvation. With heavyweight locks, the + * releaser grants to the next queue member, preventing + * starvation. Buffer pins have no such mechanism. + */ + pg_usleep(10000); + } + /* * Remove flag marking us as waiter. Normally this will not be set * anymore, but ProcWaitForSignal() can return for other signals as diff --git a/src/test/isolation/Makefile b/src/test/isolation/Makefile index ade2256..ca487de 100644 --- a/src/test/isolation/Makefile +++ b/src/test/isolation/Makefile @@ -64,6 +64,9 @@ installcheck: all check: all $(pg_isolation_regress_check) --schedule=$(srcdir)/isolation_schedule +nmcheck: all temp-install + $(pg_isolation_regress_check) intra-grant-inplace-db + # Non-default tests. It only makes sense to run these if set up to use # prepared transactions, via TEMP_CONFIG for the check case, or via the # postgresql.conf for the installcheck case. diff --git a/src/test/isolation/expected/intra-grant-inplace-db.out b/src/test/isolation/expected/intra-grant-inplace-db.out index a91402c..c019952 100644 --- a/src/test/isolation/expected/intra-grant-inplace-db.out +++ b/src/test/isolation/expected/intra-grant-inplace-db.out @@ -1,6 +1,6 @@ Parsed test spec with 3 sessions -starting permutation: snap3 b1 grant1 vac2 snap3 c1 cmp3 +starting permutation: snap3 b1 grant1 wait vac2 snap3 c1 cmp3 step snap3: INSERT INTO frozen_witness SELECT datfrozenxid FROM pg_database WHERE datname = current_catalog; @@ -9,6 +9,14 @@ step b1: BEGIN; step grant1: GRANT TEMP ON DATABASE isolation_regression TO regress_temp_grantee; +step wait: + DO $$ + BEGIN + WHILE NOT EXISTS(SELECT 1 FROM pg_locks WHERE NOT granted AND transactionid IS NOT NULL) LOOP + NULL; + END LOOP; + END$$; + step vac2: VACUUM (FREEZE); <waiting ...> step snap3: INSERT INTO frozen_witness diff --git a/src/test/isolation/specs/intra-grant-inplace-db.spec b/src/test/isolation/specs/intra-grant-inplace-db.spec index 9de40ec..7a9cae4 100644 --- a/src/test/isolation/specs/intra-grant-inplace-db.spec +++ b/src/test/isolation/specs/intra-grant-inplace-db.spec @@ -40,6 +40,19 @@ step cmp3 { WHERE datname = current_catalog AND age(datfrozenxid) > (SELECT min(age(x)) FROM frozen_witness); } +# Busy-wait until some autovacuum workers's vac_update_datfrozenxid() blocks +# in heap_inplace_wait(), waiting for the xid of s1 to end. A vac2 at that +# point will block trying to acquire a cleanup lock on the page autovacuum has +# pinned. That's because systable_inplace_update_begin() acquires a pin that +# we hold until systable_inplace_update_finish(). +step wait { + DO $$ + BEGIN + WHILE NOT EXISTS(SELECT 1 FROM pg_locks WHERE NOT granted AND transactionid IS NOT NULL) LOOP + NULL; + END LOOP; + END$$; +} -permutation snap3 b1 grant1 vac2(c1) snap3 c1 cmp3 +permutation snap3 b1 grant1 wait vac2(c1) snap3 c1 cmp3
Author: Noah Misch <n...@leadboat.com> Commit: Noah Misch <n...@leadboat.com> Unpin buffer before inplace update waits for an XID to end. Commit a07e03fd8fa7daf4d1356f7cb501ffe784ea6257 changed inplace updates to wait for heap_update() commands like GRANT TABLE and GRANT DATABASE. By keeping the pin during that wait, a sequence of autovacuum workers and an uncommitted GRANT starved one foreground LockBufferForCleanup() for six minutes, on buildfarm member sarus. Prevent, at the cost of a bit of complexity. Back-patch to v12, like the earlier commit. That commit and heap_inplace_lock() have not yet appeared in any release. Reviewed by FIXME. Discussion: https://postgr.es/m/20241026184936.ae.nmi...@google.com diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 4c8febd..f4f1cb4 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -6165,8 +6165,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid) * transaction. If compatible, return true with the buffer exclusive-locked, * and the caller must release that by calling * heap_inplace_update_and_unlock(), calling heap_inplace_unlock(), or raising - * an error. Otherwise, return false after blocking transactions, if any, - * have ended. + * an error. Otherwise, call release_callback(arg), wait for blocking + * transactions to end, and return false. * * Since this is intended for system catalogs and SERIALIZABLE doesn't cover * DDL, this doesn't guarantee any particular predicate locking. @@ -6200,7 +6200,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid) */ bool heap_inplace_lock(Relation relation, - HeapTuple oldtup_ptr, Buffer buffer) + HeapTuple oldtup_ptr, Buffer buffer, + void (*release_callback) (void *), void *arg) { HeapTupleData oldtup = *oldtup_ptr; /* minimize diff vs. heap_update() */ TM_Result result; @@ -6265,6 +6266,7 @@ heap_inplace_lock(Relation relation, lockmode, NULL)) { LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + release_callback(arg); ret = false; MultiXactIdWait((MultiXactId) xwait, mxact_status, infomask, relation, &oldtup.t_self, XLTW_Update, @@ -6280,6 +6282,7 @@ heap_inplace_lock(Relation relation, else { LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + release_callback(arg); ret = false; XactLockTableWait(xwait, relation, &oldtup.t_self, XLTW_Update); @@ -6291,6 +6294,7 @@ heap_inplace_lock(Relation relation, if (!ret) { LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + release_callback(arg); } } diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c index 69c3608..60c6103 100644 --- a/src/backend/access/index/genam.c +++ b/src/backend/access/index/genam.c @@ -814,6 +814,7 @@ systable_inplace_update_begin(Relation relation, int retries = 0; SysScanDesc scan; HeapTuple oldtup; + BufferHeapTupleTableSlot *bslot; /* * For now, we don't allow parallel updates. Unlike a regular update, @@ -835,10 +836,9 @@ systable_inplace_update_begin(Relation relation, Assert(IsInplaceUpdateRelation(relation) || !IsSystemRelation(relation)); /* Loop for an exclusive-locked buffer of a non-updated tuple. */ - for (;;) + do { TupleTableSlot *slot; - BufferHeapTupleTableSlot *bslot; CHECK_FOR_INTERRUPTS(); @@ -864,11 +864,9 @@ systable_inplace_update_begin(Relation relation, slot = scan->slot; Assert(TTS_IS_BUFFERTUPLE(slot)); bslot = (BufferHeapTupleTableSlot *) slot; - if (heap_inplace_lock(scan->heap_rel, - bslot->base.tuple, bslot->buffer)) - break; - systable_endscan(scan); - }; + } while (!heap_inplace_lock(scan->heap_rel, + bslot->base.tuple, bslot->buffer, + (void (*) (void *)) systable_endscan, scan)); *oldtupcopy = heap_copytuple(oldtup); *state = scan; diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index b951466..96cf82f 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -338,7 +338,8 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple, Buffer *buffer, struct TM_FailureData *tmfd); extern bool heap_inplace_lock(Relation relation, - HeapTuple oldtup_ptr, Buffer buffer); + HeapTuple oldtup_ptr, Buffer buffer, + void (*release_callback) (void *), void *arg); extern void heap_inplace_update_and_unlock(Relation relation, HeapTuple oldtup, HeapTuple tuple, Buffer buffer);