On Sun, Oct 27, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote:
> 27.10.2024 07:09, Noah Misch wrote:
> > On Sat, Oct 26, 2024 at 11:49:36AM -0700, Noah Misch wrote:
> > > intra-grant-inplace-db.spec got a novel failure today:
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sarus&dt=2024-10-26%2014%3A08%3A58
> > > 
> > > The isolationtester_waiting query is supposed to detect that step vac2 is
> > > blocked.  vac2 didn't finish within the timeout, but 
> > > isolationtester_waiting
> > > never considered it blocked.
> > > ... work on reproducing this.

The attached demo reproduces $SUBJECT for me.  The clue was 'CONTEXT:  while
scanning block 0 of relation "pg_catalog.pg_database"'.  That
vacuum_error_callback() message indicated VACUUM_ERRCB_PHASE_SCAN_HEAP as the
current phase.  That implies vac2 had not reached any inplace update, any
LOCKTAG_TUPLE, or any part of vac_update_datfrozenxid().  I bet vac2 was stuck
in LockBufferForCleanup().  Concurrently, a sequence of autovacuum workers
held a buffer pin blocking that cleanup.  The demo makes two changes:

a. Start intra-grant-inplace-db.spec step vac2 just after some autovacuum
   worker is blocked on the xid of the uncommitted GRANT.  While blocked, the
   worker holds a pin on the one pg_database page.  vac2 needs
   LockBufferForCleanup($THAT_PAGE).
b. Make LockBufferForCleanup() sleep 10ms after WAIT_EVENT_BUFFER_PIN, to give
   the next autovacuum worker time to win the pinning race.

LockBufferForCleanup() waits for a pin count to fall to 1.  While waiting, its
techniques are less sophisticated than what we have for lock.c heavyweight
locks.  UnpinBufferNoOwner() notifies the cleanup waiter, but nothing stops
another backend from pinning before the cleanup waiter reacts.  We have code
to cancel an autovacuum for the purpose of liberating a heavyweight lock, but
we don't have code like that to free a pin.
pg_isolation_test_session_is_blocked() doesn't detect buffer pin blockages.


Two fix candidates look most promising:

1. Unpin before heap_inplace_lock() sleeps.
2. Change intra-grant-inplace-db.spec to test freezing only MXIDs, not XIDs.

Let's do (1), as attached.  Apart from some arguably-convoluted call stacks,
this has no disadvantages.  An earlier version did unpin.
postgr.es/m/20240822073200.4f.nmi...@google.com stopped that, arguing,
"heap_update() doesn't optimize that way".  In light of $SUBJECT, I no longer
see heap_update() as the applicable standard.  Since autovacuum can run
anytime, it has an extra duty to be non-disruptive.  heap_update() during
auto-analyze won't wait much, because analyze takes a self-exclusive table
lock and is the sole writer of stats tables.  Inplace updates during
autovacuum are different.  They do contend with other transactions.  Since we
lack code to cancel an autovacuum to free a pin, a long-lived pin in
autovacuum is more disruptive than a long-lived lock in autovacuum.

While I've not tried it, I expect (2) would work as follows.  pg_database
won't contain MXIDs, so lazy_scan_noprune() would approve continuing without
the cleanup lock.  Unlike (1), this wouldn't help concurrency outside the
test.

> FWIW, there was a similar failure in August: [1], and I also could not
> reproduce that locally, yet wrote a preliminary analysis at [2] in the
> Unsorted section, in the hope to see it again and continue investigation.
> 
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=iguana&dt=2024-08-29%2013%3A57%3A57
> [2] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures

Thanks.  I had not known about iguana's failure.  What are the chances that
the buffer pin could explain that one?
Author:     Noah Misch <n...@leadboat.com>
Commit:     Noah Misch <n...@leadboat.com>

    [not for commit] demo pin starvation from autovacuum inplace update
    
    Test with intra-grant-inplace-db.spec.  Changes here:
    
    a. Start intra-grant-inplace-db.spec step vac2 just after some
       autovacuum worker is blocked on the xid of the uncommitted GRANT.
       While blocked, the worker holds a pin on the one pg_database page.
       vac2 needs LockBufferForCleanup(that-page).
    b. Make LockBufferForCleanup() sleep 10ms after WAIT_EVENT_BUFFER_PIN,
       to give the next autovacuum worker time to win the pinning race.

diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index 0f02bf6..5268294 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5344,8 +5344,17 @@ LockBufferForCleanup(Buffer buffer)
                        SetStartupBufferPinWaitBufId(-1);
                }
                else
+               {
                        ProcWaitForSignal(WAIT_EVENT_BUFFER_PIN);
 
+                       /*
+                        * Sleep to induce starvation.  With heavyweight locks, 
the
+                        * releaser grants to the next queue member, preventing
+                        * starvation.  Buffer pins have no such mechanism.
+                        */
+                       pg_usleep(10000);
+               }
+
                /*
                 * Remove flag marking us as waiter. Normally this will not be 
set
                 * anymore, but ProcWaitForSignal() can return for other 
signals as
diff --git a/src/test/isolation/Makefile b/src/test/isolation/Makefile
index ade2256..ca487de 100644
--- a/src/test/isolation/Makefile
+++ b/src/test/isolation/Makefile
@@ -64,6 +64,9 @@ installcheck: all
 check: all
        $(pg_isolation_regress_check) --schedule=$(srcdir)/isolation_schedule
 
+nmcheck: all temp-install
+       $(pg_isolation_regress_check) intra-grant-inplace-db
+
 # Non-default tests.  It only makes sense to run these if set up to use
 # prepared transactions, via TEMP_CONFIG for the check case, or via the
 # postgresql.conf for the installcheck case.
diff --git a/src/test/isolation/expected/intra-grant-inplace-db.out 
b/src/test/isolation/expected/intra-grant-inplace-db.out
index a91402c..c019952 100644
--- a/src/test/isolation/expected/intra-grant-inplace-db.out
+++ b/src/test/isolation/expected/intra-grant-inplace-db.out
@@ -1,6 +1,6 @@
 Parsed test spec with 3 sessions
 
-starting permutation: snap3 b1 grant1 vac2 snap3 c1 cmp3
+starting permutation: snap3 b1 grant1 wait vac2 snap3 c1 cmp3
 step snap3: 
        INSERT INTO frozen_witness
        SELECT datfrozenxid FROM pg_database WHERE datname = current_catalog;
@@ -9,6 +9,14 @@ step b1: BEGIN;
 step grant1: 
        GRANT TEMP ON DATABASE isolation_regression TO regress_temp_grantee;
 
+step wait: 
+       DO $$
+       BEGIN
+               WHILE NOT EXISTS(SELECT 1 FROM pg_locks WHERE NOT granted AND 
transactionid IS NOT NULL) LOOP
+                       NULL;
+               END LOOP;
+       END$$;
+
 step vac2: VACUUM (FREEZE); <waiting ...>
 step snap3: 
        INSERT INTO frozen_witness
diff --git a/src/test/isolation/specs/intra-grant-inplace-db.spec 
b/src/test/isolation/specs/intra-grant-inplace-db.spec
index 9de40ec..7a9cae4 100644
--- a/src/test/isolation/specs/intra-grant-inplace-db.spec
+++ b/src/test/isolation/specs/intra-grant-inplace-db.spec
@@ -40,6 +40,19 @@ step cmp3    {
        WHERE datname = current_catalog
                AND age(datfrozenxid) > (SELECT min(age(x)) FROM 
frozen_witness);
 }
+# Busy-wait until some autovacuum workers's vac_update_datfrozenxid() blocks
+# in heap_inplace_wait(), waiting for the xid of s1 to end.  A vac2 at that
+# point will block trying to acquire a cleanup lock on the page autovacuum has
+# pinned.  That's because systable_inplace_update_begin() acquires a pin that
+# we hold until systable_inplace_update_finish().
+step wait      {
+       DO $$
+       BEGIN
+               WHILE NOT EXISTS(SELECT 1 FROM pg_locks WHERE NOT granted AND 
transactionid IS NOT NULL) LOOP
+                       NULL;
+               END LOOP;
+       END$$;
+}
 
 
-permutation snap3 b1 grant1 vac2(c1) snap3 c1 cmp3
+permutation snap3 b1 grant1 wait vac2(c1) snap3 c1 cmp3
Author:     Noah Misch <n...@leadboat.com>
Commit:     Noah Misch <n...@leadboat.com>

    Unpin buffer before inplace update waits for an XID to end.
    
    Commit a07e03fd8fa7daf4d1356f7cb501ffe784ea6257 changed inplace updates
    to wait for heap_update() commands like GRANT TABLE and GRANT DATABASE.
    By keeping the pin during that wait, a sequence of autovacuum workers
    and an uncommitted GRANT starved one foreground LockBufferForCleanup()
    for six minutes, on buildfarm member sarus.  Prevent, at the cost of a
    bit of complexity.  Back-patch to v12, like the earlier commit.  That
    commit and heap_inplace_lock() have not yet appeared in any release.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/20241026184936.ae.nmi...@google.com

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4c8febd..f4f1cb4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6165,8 +6165,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
  * transaction.  If compatible, return true with the buffer exclusive-locked,
  * and the caller must release that by calling
  * heap_inplace_update_and_unlock(), calling heap_inplace_unlock(), or raising
- * an error.  Otherwise, return false after blocking transactions, if any,
- * have ended.
+ * an error.  Otherwise, call release_callback(arg), wait for blocking
+ * transactions to end, and return false.
  *
  * Since this is intended for system catalogs and SERIALIZABLE doesn't cover
  * DDL, this doesn't guarantee any particular predicate locking.
@@ -6200,7 +6200,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
  */
 bool
 heap_inplace_lock(Relation relation,
-                                 HeapTuple oldtup_ptr, Buffer buffer)
+                                 HeapTuple oldtup_ptr, Buffer buffer,
+                                 void (*release_callback) (void *), void *arg)
 {
        HeapTupleData oldtup = *oldtup_ptr; /* minimize diff vs. heap_update() 
*/
        TM_Result       result;
@@ -6265,6 +6266,7 @@ heap_inplace_lock(Relation relation,
                                                                                
lockmode, NULL))
                        {
                                LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+                               release_callback(arg);
                                ret = false;
                                MultiXactIdWait((MultiXactId) xwait, 
mxact_status, infomask,
                                                                relation, 
&oldtup.t_self, XLTW_Update,
@@ -6280,6 +6282,7 @@ heap_inplace_lock(Relation relation,
                else
                {
                        LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+                       release_callback(arg);
                        ret = false;
                        XactLockTableWait(xwait, relation, &oldtup.t_self,
                                                          XLTW_Update);
@@ -6291,6 +6294,7 @@ heap_inplace_lock(Relation relation,
                if (!ret)
                {
                        LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+                       release_callback(arg);
                }
        }
 
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 69c3608..60c6103 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -814,6 +814,7 @@ systable_inplace_update_begin(Relation relation,
        int                     retries = 0;
        SysScanDesc scan;
        HeapTuple       oldtup;
+       BufferHeapTupleTableSlot *bslot;
 
        /*
         * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -835,10 +836,9 @@ systable_inplace_update_begin(Relation relation,
        Assert(IsInplaceUpdateRelation(relation) || 
!IsSystemRelation(relation));
 
        /* Loop for an exclusive-locked buffer of a non-updated tuple. */
-       for (;;)
+       do
        {
                TupleTableSlot *slot;
-               BufferHeapTupleTableSlot *bslot;
 
                CHECK_FOR_INTERRUPTS();
 
@@ -864,11 +864,9 @@ systable_inplace_update_begin(Relation relation,
                slot = scan->slot;
                Assert(TTS_IS_BUFFERTUPLE(slot));
                bslot = (BufferHeapTupleTableSlot *) slot;
-               if (heap_inplace_lock(scan->heap_rel,
-                                                         bslot->base.tuple, 
bslot->buffer))
-                       break;
-               systable_endscan(scan);
-       };
+       } while (!heap_inplace_lock(scan->heap_rel,
+                                                               
bslot->base.tuple, bslot->buffer,
+                                                               (void (*) (void 
*)) systable_endscan, scan));
 
        *oldtupcopy = heap_copytuple(oldtup);
        *state = scan;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b951466..96cf82f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -338,7 +338,8 @@ extern TM_Result heap_lock_tuple(Relation relation, 
HeapTuple tuple,
                                                                 Buffer 
*buffer, struct TM_FailureData *tmfd);
 
 extern bool heap_inplace_lock(Relation relation,
-                                                         HeapTuple oldtup_ptr, 
Buffer buffer);
+                                                         HeapTuple oldtup_ptr, 
Buffer buffer,
+                                                         void 
(*release_callback) (void *), void *arg);
 extern void heap_inplace_update_and_unlock(Relation relation,
                                                                                
   HeapTuple oldtup, HeapTuple tuple,
                                                                                
   Buffer buffer);

Reply via email to