Hi, On 2025-09-15 19:05:37 -0400, Andres Freund wrote: > On 2025-08-22 15:44:48 -0400, Andres Freund wrote: > > The hardest part about this change is that everything kind of depends on > > each > > other. The changes are large enough that they clearly can't just be > > committed > > at once, but doing them over time risks [temporary] performance regressions. > > > > > > > > > > The order of changes I think makes the most sense is the following: > > > > 1) Allow some modifications while holding the buffer header spinlock > > > > 2) Reduce buffer pin with just an atomic-sub > > > > This needs to happen first, otherwise there are performance regressions > > during the later steps. > > Here are the first few cleaned up patches implementing the above steps, as > well as some cleanups. I included a commit from another thread, as it > conflicts with these changes, and we really should apply it - and it's > arguably required to make the changes viable, as it removes one more use of > PinBuffer_Locked(). > > Another change included is to not return the buffer with the spinlock held > from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason > for that is to reduce the most common PinBuffer_locked() call. By definition > PinBuffer_locked() will become a bit slower due to 0003. But even without 0003 > it 0002 is faster than master. And the previous approach also just seems > pretty unclean. I don't love that it requires the new TrackNewBufferPin(), > but I don't really have a better idea. > > I invite particular attention to the commit message for 0003 as well as the > comment changes in buf_internals.h within.
Robert looked at the patches while we were chatting, and I addressed his feedback in this new version. Changes: - Updated patch description for 0002, giving a lot more background - Improved BufferDesc comments a fair bit more in 0003 - Reduced the size of 0003 a bit, by using UnlockBufHdrExt() instead of a CAS loop in buffer_stage_common() and reordering some things in TerminateBufferIO() - Made 0004 only use the non-looping atomic op in UnpinBufferNoOwner(), not MarkBufferDirty(). I realized the latter would take additional complexity to make safe (a CAS loop in TerminateBufferIO()). I am somewhat doubtful that there are workloads where it matters... Greetings, Andres Freund
>From 5829480f2ee93a8ee438756ebc79d54e82e3dc88 Mon Sep 17 00:00:00 2001 From: Thomas Munro <[email protected]> Date: Thu, 29 Jun 2023 10:52:56 +1200 Subject: [PATCH v4 1/6] Improve ReadRecentBuffer() scalability. While testing a new potential use for ReadRecentBuffer(), Andres reported that it scales badly when called concurrently for the same buffer by many backends. Instead of a naive (but wrong) coding with PinBuffer(), it used the spinlock, so that it could be careful to pin only if the buffer was valid and holding the expected block, to avoid breaking invariants in eg GetVictimBuffer(). Unfortunately that made it less scalable than PinBuffer(), which uses compare-exchange instead. We can fix that by giving PinBuffer() a new skip_if_not_valid mode that doesn't pin invalid buffers. It might occasionally skip when it shouldn't due to the unlocked read of the header flags, but that's unlikely and perfectly acceptable for an opportunistic optimisation routine, and it can only succeed when it really should due to the compare-exchange loop. XXX This also fixes ReadRecentBuffer()'s failure to bump the usage count. Fix separately or back-patch this? Author: Thomas Munro <[email protected]> Reported-by: Andres Freund <[email protected]> Reviewed-by: Andres Freund <[email protected]> Discussion: https://postgr.es/m/20230627020546.t6z4tntmj7wmjrfh%40awork3.anarazel.de --- src/backend/storage/buffer/bufmgr.c | 60 +++++++++++++---------------- 1 file changed, 26 insertions(+), 34 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index fe470de63f2..b5eebfb6990 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -512,7 +512,8 @@ static BlockNumber ExtendBufferedRelShared(BufferManagerRelation bmr, BlockNumber extend_upto, Buffer *buffers, uint32 *extended_by); -static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy); +static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy, + bool skip_if_not_valid); static void PinBuffer_Locked(BufferDesc *buf); static void UnpinBuffer(BufferDesc *buf); static void UnpinBufferNoOwner(BufferDesc *buf); @@ -685,7 +686,6 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN BufferDesc *bufHdr; BufferTag tag; uint32 buf_state; - bool have_private_ref; Assert(BufferIsValid(recent_buffer)); @@ -713,38 +713,24 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN else { bufHdr = GetBufferDescriptor(recent_buffer - 1); - have_private_ref = GetPrivateRefCount(recent_buffer) > 0; /* - * Do we already have this buffer pinned with a private reference? If - * so, it must be valid and it is safe to check the tag without - * locking. If not, we have to lock the header first and then check. + * Is it still valid and holding the right tag? We do an unlocked tag + * comparison first, to make it unlikely that we'll increment the + * usage counter of the wrong buffer, if someone calls us with a very + * out of date recent_buffer. Then we'll check it again if we get the + * pin. */ - if (have_private_ref) - buf_state = pg_atomic_read_u32(&bufHdr->state); - else - buf_state = LockBufHdr(bufHdr); - - if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag)) + if (BufferTagsEqual(&tag, &bufHdr->tag) && + PinBuffer(bufHdr, NULL, true)) { - /* - * It's now safe to pin the buffer. We can't pin first and ask - * questions later, because it might confuse code paths like - * InvalidateBuffer() if we pinned a random non-matching buffer. - */ - if (have_private_ref) - PinBuffer(bufHdr, NULL); /* bump pin count */ - else - PinBuffer_Locked(bufHdr); /* pin for first time */ - - pgBufferUsage.shared_blks_hit++; - - return true; + if (BufferTagsEqual(&tag, &bufHdr->tag)) + { + pgBufferUsage.shared_blks_hit++; + return true; + } + UnpinBuffer(bufHdr); } - - /* If we locked the header above, now unlock. */ - if (!have_private_ref) - UnlockBufHdr(bufHdr, buf_state); } return false; @@ -2036,7 +2022,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, */ buf = GetBufferDescriptor(existing_buf_id); - valid = PinBuffer(buf, strategy); + valid = PinBuffer(buf, strategy, false); /* Can release the mapping lock as soon as we've pinned it */ LWLockRelease(newPartitionLock); @@ -2098,7 +2084,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, existing_buf_hdr = GetBufferDescriptor(existing_buf_id); - valid = PinBuffer(existing_buf_hdr, strategy); + valid = PinBuffer(existing_buf_hdr, strategy, false); /* Can release the mapping lock as soon as we've pinned it */ LWLockRelease(newPartitionLock); @@ -2736,7 +2722,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, * Pin the existing buffer before releasing the partition lock, * preventing it from being evicted. */ - valid = PinBuffer(existing_hdr, strategy); + valid = PinBuffer(existing_hdr, strategy, false); LWLockRelease(partition_lock); UnpinBuffer(victim_buf_hdr); @@ -3035,10 +3021,13 @@ ReleaseAndReadBuffer(Buffer buffer, * must have been done already. * * Returns true if buffer is BM_VALID, else false. This provision allows - * some callers to avoid an extra spinlock cycle. + * some callers to avoid an extra spinlock cycle. If skip_if_not_valid is + * true, then a false return value also indicates that the buffer was + * (recently) invalid and has not been pinned. */ static bool -PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy) +PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy, + bool skip_if_not_valid) { Buffer b = BufferDescriptorGetBuffer(buf); bool result; @@ -3062,6 +3051,9 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy) if (old_buf_state & BM_LOCKED) old_buf_state = WaitBufHdrUnlocked(buf); + if (unlikely(skip_if_not_valid && !(old_buf_state & BM_VALID))) + return false; + buf_state = old_buf_state; /* increase refcount */ -- 2.48.1.76.g4e746b1a31.dirty
>From f2f8ce7f495c6ac0e84847e532fcba2d6914b37d Mon Sep 17 00:00:00 2001 From: Andres Freund <[email protected]> Date: Mon, 8 Sep 2025 17:16:01 -0400 Subject: [PATCH v4 2/6] bufmgr: Don't lock buffer header in StrategyGetBuffer() Previously StrategyGetBuffer() acquired the buffer header spinlock for every buffer, whether it was reusable or not. If reusable, it'd be returned, with the lock held, to GetVictimBuffer(), which then would pin the buffer with PinBuffer_Locked(). That's somewhat violating the spirit of the guidelines for holding spinlocks (i.e. that they are only held for a few lines of consecutive code) and necessitates using PinBuffer_Locked(), which scales worse than PinBuffer() due to holding the spinlock. This alone makes it worth changing the code. However, the main reason to change this is that a future commit will make PinBuffer_Locked() slower (due to making UnlockBufHdr() slower), to gain scalability for the much more common case of pinning a pre-existing buffer. By pinning the buffer with a single atomic operation, iff the buffer is reusable, we avoid any potential regression for miss-heavy workloads. There strictly are fewer atomic operations for each potential buffer after this change. The price for this improvement is that freelist.c needs two CAS loops and needs to be able to set up the resource accounting for pinned buffers. The latter is achieved by exposing a new function for that purpose from bufmgr.c, that seems better than exposing the entire private refcount infrastructure. The improvement seems worth the complexity. Reviewed-by: Robert Haas <[email protected]> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff --- src/include/storage/buf_internals.h | 4 + src/backend/storage/buffer/bufmgr.c | 44 +++++------ src/backend/storage/buffer/freelist.c | 110 +++++++++++++++++++------- 3 files changed, 105 insertions(+), 53 deletions(-) diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h index dfd614f7ca4..c1206a46aba 100644 --- a/src/include/storage/buf_internals.h +++ b/src/include/storage/buf_internals.h @@ -371,6 +371,8 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state) pg_atomic_write_u32(&desc->state, buf_state & (~BM_LOCKED)); } +extern uint32 WaitBufHdrUnlocked(BufferDesc *buf); + /* in bufmgr.c */ /* @@ -425,6 +427,8 @@ extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_co extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context, IOContext io_context, BufferTag *tag); +extern void TrackNewBufferPin(Buffer buf); + /* solely to make it easier to write tests */ extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait); extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits, diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index b5eebfb6990..0b841eb7838 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -518,7 +518,6 @@ static void PinBuffer_Locked(BufferDesc *buf); static void UnpinBuffer(BufferDesc *buf); static void UnpinBufferNoOwner(BufferDesc *buf); static void BufferSync(int flags); -static uint32 WaitBufHdrUnlocked(BufferDesc *buf); static int SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context); static void WaitIO(BufferDesc *buf); @@ -2325,7 +2324,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context) /* * Ensure, while the spinlock's not yet held, that there's a free refcount - * entry, and a resource owner slot for the pin. + * entry, and a resource owner slot for the pin. FIXME: Need to be updated */ ReservePrivateRefCountEntry(); ResourceOwnerEnlarge(CurrentResourceOwner); @@ -2334,17 +2333,11 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context) again: /* - * Select a victim buffer. The buffer is returned with its header - * spinlock still held! + * Select a victim buffer. The buffer is returned pinned by this backend. */ buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring); buf = BufferDescriptorGetBuffer(buf_hdr); - Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0); - - /* Pin the buffer and then release the buffer spinlock */ - PinBuffer_Locked(buf_hdr); - /* * We shouldn't have any other pins for this buffer. */ @@ -3043,8 +3036,6 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy, uint32 buf_state; uint32 old_buf_state; - ref = NewPrivateRefCountEntry(b); - old_buf_state = pg_atomic_read_u32(&buf->state); for (;;) { @@ -3091,6 +3082,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy, break; } } + + TrackNewBufferPin(b); } else { @@ -3110,11 +3103,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy, * cannot meddle with that. */ result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0; + + Assert(ref->refcount > 0); + ref->refcount++; + ResourceOwnerRememberBuffer(CurrentResourceOwner, b); } - ref->refcount++; - Assert(ref->refcount > 0); - ResourceOwnerRememberBuffer(CurrentResourceOwner, b); return result; } @@ -3143,8 +3137,6 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy, static void PinBuffer_Locked(BufferDesc *buf) { - Buffer b; - PrivateRefCountEntry *ref; uint32 buf_state; /* @@ -3169,12 +3161,7 @@ PinBuffer_Locked(BufferDesc *buf) buf_state += BUF_REFCOUNT_ONE; UnlockBufHdr(buf, buf_state); - b = BufferDescriptorGetBuffer(buf); - - ref = NewPrivateRefCountEntry(b); - ref->refcount++; - - ResourceOwnerRememberBuffer(CurrentResourceOwner, b); + TrackNewBufferPin(BufferDescriptorGetBuffer(buf)); } /* @@ -3289,6 +3276,17 @@ UnpinBufferNoOwner(BufferDesc *buf) } } +inline void +TrackNewBufferPin(Buffer buf) +{ + PrivateRefCountEntry *ref; + + ref = NewPrivateRefCountEntry(buf); + ref->refcount++; + + ResourceOwnerRememberBuffer(CurrentResourceOwner, buf); +} + #define ST_SORT sort_checkpoint_bufferids #define ST_ELEMENT_TYPE CkptSortItem #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b) @@ -6242,7 +6240,7 @@ LockBufHdr(BufferDesc *desc) * Obviously the buffer could be locked by the time the value is returned, so * this is primarily useful in CAS style loops. */ -static uint32 +pg_noinline uint32 WaitBufHdrUnlocked(BufferDesc *buf) { SpinDelayStatus delayStatus; diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 7d59a92bd1a..9d9fb0471a0 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -164,8 +164,8 @@ ClockSweepTick(void) * * strategy is a BufferAccessStrategy object, or NULL for default strategy. * - * To ensure that no one else can pin the buffer before we do, we must - * return the buffer with the buffer header spinlock still held. + * The buffer is pinned before returning, but resource ownership is the + * responsibility of the caller. */ BufferDesc * StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_ring) @@ -173,7 +173,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r BufferDesc *buf; int bgwprocno; int trycounter; - uint32 local_buf_state; /* to avoid repeated (de-)referencing */ *from_ring = false; @@ -228,44 +227,74 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r trycounter = NBuffers; for (;;) { + uint32 old_buf_state; + uint32 local_buf_state; + buf = GetBufferDescriptor(ClockSweepTick()); /* * If the buffer is pinned or has a nonzero usage_count, we cannot use * it; decrement the usage_count (unless pinned) and keep scanning. */ - local_buf_state = LockBufHdr(buf); - if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0) + old_buf_state = pg_atomic_read_u32(&buf->state); + + for (;;) { + local_buf_state = old_buf_state; + + if (BUF_STATE_GET_REFCOUNT(local_buf_state) != 0) + { + if (--trycounter == 0) + { + /* + * We've scanned all the buffers without making any state + * changes, so all the buffers are pinned (or were when we + * looked at them). We could hope that someone will free + * one eventually, but it's probably better to fail than + * to risk getting stuck in an infinite loop. + */ + elog(ERROR, "no unpinned buffers available"); + } + break; + } + + if (unlikely(local_buf_state & BM_LOCKED)) + { + old_buf_state = WaitBufHdrUnlocked(buf); + continue; + } + if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0) { local_buf_state -= BUF_USAGECOUNT_ONE; - trycounter = NBuffers; + if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state, + local_buf_state)) + { + trycounter = NBuffers; + break; + } } else { - /* Found a usable buffer */ - if (strategy != NULL) - AddBufferToRing(strategy, buf); - *buf_state = local_buf_state; - return buf; + local_buf_state += BUF_REFCOUNT_ONE; + + if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state, + local_buf_state)) + { + /* Found a usable buffer */ + if (strategy != NULL) + AddBufferToRing(strategy, buf); + *buf_state = local_buf_state; + + TrackNewBufferPin(BufferDescriptorGetBuffer(buf)); + + return buf; + } } + } - else if (--trycounter == 0) - { - /* - * We've scanned all the buffers without making any state changes, - * so all the buffers are pinned (or were when we looked at them). - * We could hope that someone will free one eventually, but it's - * probably better to fail than to risk getting stuck in an - * infinite loop. - */ - UnlockBufHdr(buf, local_buf_state); - elog(ERROR, "no unpinned buffers available"); - } - UnlockBufHdr(buf, local_buf_state); } } @@ -621,6 +650,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state) { BufferDesc *buf; Buffer bufnum; + uint32 old_buf_state; uint32 local_buf_state; /* to avoid repeated (de-)referencing */ @@ -647,14 +677,34 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state) * shouldn't re-use it. */ buf = GetBufferDescriptor(bufnum - 1); - local_buf_state = LockBufHdr(buf); - if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0 - && BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1) + + old_buf_state = pg_atomic_read_u32(&buf->state); + + for (;;) { - *buf_state = local_buf_state; - return buf; + local_buf_state = old_buf_state; + + if (BUF_STATE_GET_REFCOUNT(local_buf_state) != 0 + || BUF_STATE_GET_USAGECOUNT(local_buf_state) > 1) + break; + + if (unlikely(local_buf_state & BM_LOCKED)) + { + old_buf_state = WaitBufHdrUnlocked(buf); + continue; + } + + local_buf_state += BUF_REFCOUNT_ONE; + + if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state, + local_buf_state)) + { + *buf_state = local_buf_state; + + TrackNewBufferPin(BufferDescriptorGetBuffer(buf)); + return buf; + } } - UnlockBufHdr(buf, local_buf_state); /* * Tell caller to allocate a new buffer with the normal allocation -- 2.48.1.76.g4e746b1a31.dirty
>From 27d90068f2e4841434f35e3f5ecb6fe7dceeca50 Mon Sep 17 00:00:00 2001 From: Andres Freund <[email protected]> Date: Mon, 15 Sep 2025 17:06:48 -0400 Subject: [PATCH v4 3/6] bufmgr: Allow some buffer state modifications while holding header lock Until now BufferDesc.state was not allowed to be modified while the buffer header spinlock was held. This meant that operations like unpinning buffers needed to use a CAS loop, waiting for the buffer header spinlock to be released before updating. The benefit of that restriction is that it allowed us to unlock the buffer header spinlock with just a write barrier and an unlocked write (instead of a full atomic operation). That was important to avoid regressions in 48354581a49c. However, since then the hottest buffer header spinlock uses have been replaced with atomic operations (in particular, the most common use of PinBuffer_Locked(), in GetVictimBuffer() (formerly in BufferAlloc()), has been removed in FIXME-TODO-FIXME). This change will allow, in a subsequent commit, to release buffer pins with a single atomic-sub operation. This previously was not possible while such operations were not allowed while the buffer header spinlock was held, as an atomic-sub would not have allowed a race-free check for the buffer header lock being held. Using atomic-sub to unpin buffers is a nice scalability win, however it is not the primary motivation for this change (although it would be sufficient). The primary motivation is that we would like to merge the buffer content lock into BufferDesc.state, which will result in more frequent changes of the state variable, which in some situations can cause a performance regression, due to an increased CAS failure rate when unpinning buffers. The regression entirely vanishes when using atomic-sub. Naively implementing this would require putting CAS loops in every place modifying the buffer state while holding the buffer header lock. To avoid that, introduce UnlockBufHdrExt(), which can set/add flags as well as the refcount, together with releasing the lock. Reviewed-by: Robert Haas <[email protected]> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff --- src/include/storage/buf_internals.h | 119 +++++++--- src/backend/storage/buffer/bufmgr.c | 203 ++++++++++-------- src/backend/storage/buffer/freelist.c | 2 + contrib/pg_buffercache/pg_buffercache_pages.c | 7 +- contrib/pg_prewarm/autoprewarm.c | 2 +- src/test/modules/test_aio/test_aio.c | 10 +- 6 files changed, 224 insertions(+), 119 deletions(-) diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h index c1206a46aba..e25b1ece3f9 100644 --- a/src/include/storage/buf_internals.h +++ b/src/include/storage/buf_internals.h @@ -211,28 +211,36 @@ BufMappingPartitionLockByIndex(uint32 index) /* * BufferDesc -- shared descriptor/state data for a single shared buffer. * - * Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change - * tag, state or wait_backend_pgprocno fields. In general, buffer header lock - * is a spinlock which is combined with flags, refcount and usagecount into - * single atomic variable. This layout allow us to do some operations in a - * single atomic operation, without actually acquiring and releasing spinlock; - * for instance, increase or decrease refcount. buf_id field never changes - * after initialization, so does not need locking. The LWLock can take care - * of itself. The buffer header lock is *not* used to control access to the - * data in the buffer! + * The state of the buffer is controlled by the, drumroll, state variable. It + * only may be modified using atomic operations. The state variable combines + * various flags, the buffer's refcount and usage count. See comment above + * BUF_REFCOUNT_BITS for details about the division. This layout allow us to + * do some operations in a single atomic operation, without actually acquiring + * and releasing the spinlock; for instance, increasing or decreasing the + * refcount. * - * It's assumed that nobody changes the state field while buffer header lock - * is held. Thus buffer header lock holder can do complex updates of the - * state variable in single write, simultaneously with lock release (cleaning - * BM_LOCKED flag). On the other hand, updating of state without holding - * buffer header lock is restricted to CAS, which ensures that BM_LOCKED flag - * is not set. Atomic increment/decrement, OR/AND etc. are not allowed. + * One of the aforementioned flags is BM_LOCKED, used to implement the buffer + * header lock. See the following paragraphs, as well as the documentation for + * individual fields for more details. * - * An exception is that if we have the buffer pinned, its tag can't change - * underneath us, so we can examine the tag without locking the buffer header. - * Also, in places we do one-time reads of the flags without bothering to - * lock the buffer header; this is generally for situations where we don't - * expect the flag bit being tested to be changing. + * The identity of the buffer (BufferDesc.tag) can only be changed by the + * backend holding the buffer header lock. + * + * If the lock is held by another backend, neither additional buffer pins may + * be established (we would like to relax this eventually), nor can flags be + * set/cleared. These operations either need to acquire the buffer header + * spinlock, or need to use a CAS loop, waiting for the lock to be released if + * it is held. However, existing buffer pins may be released while the buffer + * header spinlock is held, using an atomic subtraction. + * + * The LWLock can take care of itself. The buffer header lock is *not* used + * to control access to the data in the buffer! + * + * If we have the buffer pinned, its tag can't change underneath us, so we can + * examine the tag without locking the buffer header. Also, in places we do + * one-time reads of the flags without bothering to lock the buffer header; + * this is generally for situations where we don't expect the flag bit being + * tested to be changing. * * We can't physically remove items from a disk page if another backend has * the buffer pinned. Hence, a backend may need to wait for all other pins @@ -256,13 +264,29 @@ BufMappingPartitionLockByIndex(uint32 index) */ typedef struct BufferDesc { - BufferTag tag; /* ID of page contained in buffer */ - int buf_id; /* buffer's index number (from 0) */ + /* + * ID of page contained in buffer. The buffer header spinlock needs to be + * held to modify this field. + */ + BufferTag tag; - /* state of the tag, containing flags, refcount and usagecount */ + /* + * Buffer's index number (from 0). The field never changes after + * initialization, so does not need locking. + */ + int buf_id; + + /* + * State of the buffer, containing flags, refcount and usagecount. See + * BUF_* and BM_* defines at the top of this file. + */ pg_atomic_uint32 state; - int wait_backend_pgprocno; /* backend of pin-count waiter */ + /* + * Backend of pin-count waiter. The buffer header spinlock needs to be + * held to modify this field. + */ + int wait_backend_pgprocno; PgAioWaitRef io_wref; /* set iff AIO is in progress */ LWLock content_lock; /* to lock access to buffer contents */ @@ -364,11 +388,52 @@ BufferDescriptorGetContentLock(const BufferDesc *bdesc) */ extern uint32 LockBufHdr(BufferDesc *desc); +/* + * Unlock the buffer header. + * + * This can only be used if the caller did not modify BufferDesc.state. To + * set/unset flag bits or change the refcount use UnlockBufHdrExt(). + */ static inline void -UnlockBufHdr(BufferDesc *desc, uint32 buf_state) +UnlockBufHdr(BufferDesc *desc) { - pg_write_barrier(); - pg_atomic_write_u32(&desc->state, buf_state & (~BM_LOCKED)); + Assert(pg_atomic_read_u32(&desc->state) & BM_LOCKED); + + pg_atomic_fetch_sub_u32(&desc->state, BM_LOCKED); +} + +/* + * Unlock the buffer header, while atomically adding the flags in set_bits, + * unsetting the ones in unset_bits and changing the refcount by + * refcount_change. + * + * Note that this approach would not work for usagecount, since we need to cap + * the usagecount at BM_MAX_USAGE_COUNT. + */ +static inline uint32 +UnlockBufHdrExt(BufferDesc *desc, uint32 old_buf_state, + uint32 set_bits, uint32 unset_bits, + int refcount_change) +{ + for (;;) + { + uint32 buf_state = old_buf_state; + + Assert(buf_state & BM_LOCKED); + + buf_state |= set_bits; + buf_state &= ~unset_bits; + buf_state &= ~BM_LOCKED; + + if (refcount_change != 0) + buf_state += BUF_REFCOUNT_ONE * refcount_change; + + if (pg_atomic_compare_exchange_u32(&desc->state, &old_buf_state, + buf_state)) + { + return old_buf_state; + } + } } extern uint32 WaitBufHdrUnlocked(BufferDesc *buf); diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 0b841eb7838..0ce4a736b58 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1994,6 +1994,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, Buffer victim_buffer; BufferDesc *victim_buf_hdr; uint32 victim_buf_state; + uint32 set_bits = 0; /* Make sure we will have room to remember the buffer pin */ ResourceOwnerEnlarge(CurrentResourceOwner); @@ -2120,11 +2121,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, * checkpoints, except for their "init" forks, which need to be treated * just like permanent relations. */ - victim_buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE; + set_bits |= BM_TAG_VALID | BUF_USAGECOUNT_ONE; if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM) - victim_buf_state |= BM_PERMANENT; + set_bits |= BM_PERMANENT; - UnlockBufHdr(victim_buf_hdr, victim_buf_state); + UnlockBufHdrExt(victim_buf_hdr, victim_buf_state, + set_bits, 0, 0); LWLockRelease(newPartitionLock); @@ -2164,9 +2166,7 @@ InvalidateBuffer(BufferDesc *buf) /* Save the original buffer tag before dropping the spinlock */ oldTag = buf->tag; - buf_state = pg_atomic_read_u32(&buf->state); - Assert(buf_state & BM_LOCKED); - UnlockBufHdr(buf, buf_state); + UnlockBufHdr(buf); /* * Need to compute the old tag's hashcode and partition lock ID. XXX is it @@ -2190,7 +2190,7 @@ retry: /* If it's changed while we were waiting for lock, do nothing */ if (!BufferTagsEqual(&buf->tag, &oldTag)) { - UnlockBufHdr(buf, buf_state); + UnlockBufHdr(buf); LWLockRelease(oldPartitionLock); return; } @@ -2207,7 +2207,7 @@ retry: */ if (BUF_STATE_GET_REFCOUNT(buf_state) != 0) { - UnlockBufHdr(buf, buf_state); + UnlockBufHdr(buf); LWLockRelease(oldPartitionLock); /* safety check: should definitely not be our *own* pin */ if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0) @@ -2222,8 +2222,11 @@ retry: */ oldFlags = buf_state & BUF_FLAG_MASK; ClearBufferTag(&buf->tag); - buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK); - UnlockBufHdr(buf, buf_state); + + UnlockBufHdrExt(buf, buf_state, + 0, + BUF_FLAG_MASK | BUF_USAGECOUNT_MASK, + 0); /* * Remove the buffer from the lookup hashtable, if it was in there. @@ -2283,7 +2286,7 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr) { Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0); - UnlockBufHdr(buf_hdr, buf_state); + UnlockBufHdr(buf_hdr); LWLockRelease(partition_lock); return false; @@ -2297,8 +2300,10 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr) * tag (see e.g. FlushDatabaseBuffers()). */ ClearBufferTag(&buf_hdr->tag); - buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK); - UnlockBufHdr(buf_hdr, buf_state); + UnlockBufHdrExt(buf_hdr, buf_state, + 0, + BUF_FLAG_MASK | BUF_USAGECOUNT_MASK, + 0); Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0); @@ -2307,6 +2312,7 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr) LWLockRelease(partition_lock); + buf_state = pg_atomic_read_u32(&buf_hdr->state); Assert(!(buf_state & (BM_DIRTY | BM_VALID | BM_TAG_VALID))); Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0); Assert(BUF_STATE_GET_REFCOUNT(pg_atomic_read_u32(&buf_hdr->state)) > 0); @@ -2396,7 +2402,7 @@ again: /* Read the LSN while holding buffer header lock */ buf_state = LockBufHdr(buf_hdr); lsn = BufferGetLSN(buf_hdr); - UnlockBufHdr(buf_hdr, buf_state); + UnlockBufHdr(buf_hdr); if (XLogNeedsFlush(lsn) && StrategyRejectBuffer(strategy, buf_hdr, from_ring)) @@ -2741,15 +2747,13 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, */ do { - uint32 buf_state = LockBufHdr(existing_hdr); - - buf_state &= ~BM_VALID; - UnlockBufHdr(existing_hdr, buf_state); + pg_atomic_fetch_and_u32(&existing_hdr->state, ~BM_VALID); } while (!StartBufferIO(existing_hdr, true, false)); } else { uint32 buf_state; + uint32 set_bits = 0; buf_state = LockBufHdr(victim_buf_hdr); @@ -2759,11 +2763,13 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, victim_buf_hdr->tag = tag; - buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE; + set_bits |= BM_TAG_VALID | BUF_USAGECOUNT_ONE; if (bmr.relpersistence == RELPERSISTENCE_PERMANENT || fork == INIT_FORKNUM) - buf_state |= BM_PERMANENT; + set_bits |= BM_PERMANENT; - UnlockBufHdr(victim_buf_hdr, buf_state); + UnlockBufHdrExt(victim_buf_hdr, buf_state, + set_bits, 0, + 0); LWLockRelease(partition_lock); @@ -2918,6 +2924,10 @@ MarkBufferDirty(Buffer buffer) Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE)); + /* + * NB: We have to wait for the buffer header spinlock to be not held, as + * TerminateBufferIO() relies on the spinlock. + */ old_buf_state = pg_atomic_read_u32(&bufHdr->state); for (;;) { @@ -3039,6 +3049,10 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy, old_buf_state = pg_atomic_read_u32(&buf->state); for (;;) { + /* + * We're not allowed to increase the refcount while the buffer + * header spinlock is held. Wait for the lock to be released. + */ if (old_buf_state & BM_LOCKED) old_buf_state = WaitBufHdrUnlocked(buf); @@ -3137,7 +3151,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy, static void PinBuffer_Locked(BufferDesc *buf) { - uint32 buf_state; + uint32 old_buf_state; /* * As explained, We don't expect any preexisting pins. That allows us to @@ -3156,10 +3170,10 @@ PinBuffer_Locked(BufferDesc *buf) * Since we hold the buffer spinlock, we can update the buffer state and * release the lock in one operation. */ - buf_state = pg_atomic_read_u32(&buf->state); - Assert(buf_state & BM_LOCKED); - buf_state += BUF_REFCOUNT_ONE; - UnlockBufHdr(buf, buf_state); + old_buf_state = pg_atomic_read_u32(&buf->state); + + UnlockBufHdrExt(buf, old_buf_state, + 0, 0, 1); TrackNewBufferPin(BufferDescriptorGetBuffer(buf)); } @@ -3194,12 +3208,13 @@ WakePinCountWaiter(BufferDesc *buf) /* we just released the last pin other than the waiter's */ int wait_backend_pgprocno = buf->wait_backend_pgprocno; - buf_state &= ~BM_PIN_COUNT_WAITER; - UnlockBufHdr(buf, buf_state); + UnlockBufHdrExt(buf, buf_state, + 0, BM_PIN_COUNT_WAITER, + 0); ProcSendSignal(wait_backend_pgprocno); } else - UnlockBufHdr(buf, buf_state); + UnlockBufHdr(buf); } /* @@ -3252,6 +3267,10 @@ UnpinBufferNoOwner(BufferDesc *buf) * * Since buffer spinlock holder can update status using just write, * it's not safe to use atomic decrement here; thus use a CAS loop. + * + * TODO: The above requirement does not hold anymore, in a future + * commit this will be rewritten to release the pin in a single atomic + * operation. */ old_buf_state = pg_atomic_read_u32(&buf->state); for (;;) @@ -3349,6 +3368,7 @@ BufferSync(int flags) for (buf_id = 0; buf_id < NBuffers; buf_id++) { BufferDesc *bufHdr = GetBufferDescriptor(buf_id); + uint32 set_bits = 0; /* * Header spinlock is enough to examine BM_DIRTY, see comment in @@ -3360,7 +3380,7 @@ BufferSync(int flags) { CkptSortItem *item; - buf_state |= BM_CHECKPOINT_NEEDED; + set_bits = BM_CHECKPOINT_NEEDED; item = &CkptBufferIds[num_to_scan++]; item->buf_id = buf_id; @@ -3370,7 +3390,9 @@ BufferSync(int flags) item->blockNum = bufHdr->tag.blockNum; } - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdrExt(bufHdr, buf_state, + set_bits, 0, + 0); /* Check for barrier events in case NBuffers is large. */ if (ProcSignalBarrierPending) @@ -3909,14 +3931,14 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context) else if (skip_recently_used) { /* Caller told us not to write recently-used buffers */ - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); return result; } if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY)) { /* It's clean, so nothing to do */ - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); return result; } @@ -4288,8 +4310,9 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object, recptr = BufferGetLSN(buf); /* To check if block content changes while flushing. - vadim 01/17/97 */ - buf_state &= ~BM_JUST_DIRTIED; - UnlockBufHdr(buf, buf_state); + UnlockBufHdrExt(buf, buf_state, + 0, BM_JUST_DIRTIED, + 0); /* * Force XLOG flush up to buffer's LSN. This implements the basic WAL @@ -4452,7 +4475,6 @@ BufferGetLSNAtomic(Buffer buffer) char *page = BufferGetPage(buffer); BufferDesc *bufHdr; XLogRecPtr lsn; - uint32 buf_state; /* * If we don't need locking for correctness, fastpath out. @@ -4465,9 +4487,9 @@ BufferGetLSNAtomic(Buffer buffer) Assert(BufferIsPinned(buffer)); bufHdr = GetBufferDescriptor(buffer - 1); - buf_state = LockBufHdr(bufHdr); + LockBufHdr(bufHdr); lsn = PageGetLSN(page); - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); return lsn; } @@ -4568,7 +4590,6 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, for (i = 0; i < NBuffers; i++) { BufferDesc *bufHdr = GetBufferDescriptor(i); - uint32 buf_state; /* * We can make this a tad faster by prechecking the buffer tag before @@ -4589,7 +4610,7 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, if (!BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator.locator)) continue; - buf_state = LockBufHdr(bufHdr); + LockBufHdr(bufHdr); for (j = 0; j < nforks; j++) { @@ -4602,7 +4623,7 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, } } if (j >= nforks) - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); } } @@ -4731,7 +4752,6 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators) { RelFileLocator *rlocator = NULL; BufferDesc *bufHdr = GetBufferDescriptor(i); - uint32 buf_state; /* * As in DropRelationBuffers, an unlocked precheck should be safe and @@ -4765,11 +4785,11 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators) if (rlocator == NULL) continue; - buf_state = LockBufHdr(bufHdr); + LockBufHdr(bufHdr); if (BufTagMatchesRelFileLocator(&bufHdr->tag, rlocator)) InvalidateBuffer(bufHdr); /* releases spinlock */ else - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); } pfree(locators); @@ -4799,7 +4819,6 @@ FindAndDropRelationBuffers(RelFileLocator rlocator, ForkNumber forkNum, LWLock *bufPartitionLock; /* buffer partition lock for it */ int buf_id; BufferDesc *bufHdr; - uint32 buf_state; /* create a tag so we can lookup the buffer */ InitBufferTag(&bufTag, &rlocator, forkNum, curBlock); @@ -4824,14 +4843,14 @@ FindAndDropRelationBuffers(RelFileLocator rlocator, ForkNumber forkNum, * evicted by some other backend loading blocks for a different * relation after we release lock on the BufMapping table. */ - buf_state = LockBufHdr(bufHdr); + LockBufHdr(bufHdr); if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator) && BufTagGetForkNum(&bufHdr->tag) == forkNum && bufHdr->tag.blockNum >= firstDelBlock) InvalidateBuffer(bufHdr); /* releases spinlock */ else - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); } } @@ -4859,7 +4878,6 @@ DropDatabaseBuffers(Oid dbid) for (i = 0; i < NBuffers; i++) { BufferDesc *bufHdr = GetBufferDescriptor(i); - uint32 buf_state; /* * As in DropRelationBuffers, an unlocked precheck should be safe and @@ -4868,11 +4886,11 @@ DropDatabaseBuffers(Oid dbid) if (bufHdr->tag.dbOid != dbid) continue; - buf_state = LockBufHdr(bufHdr); + LockBufHdr(bufHdr); if (bufHdr->tag.dbOid == dbid) InvalidateBuffer(bufHdr); /* releases spinlock */ else - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); } } @@ -4971,7 +4989,7 @@ FlushRelationBuffers(Relation rel) UnpinBuffer(bufHdr); } else - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); } } @@ -5068,7 +5086,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels) UnpinBuffer(bufHdr); } else - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); } pfree(srels); @@ -5296,7 +5314,7 @@ FlushDatabaseBuffers(Oid dbid) UnpinBuffer(bufHdr); } else - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); } } @@ -5506,8 +5524,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std) PageSetLSN(page, lsn); } - buf_state |= BM_DIRTY | BM_JUST_DIRTIED; - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdrExt(bufHdr, buf_state, + BM_DIRTY | BM_JUST_DIRTIED, + 0, 0); if (delayChkptFlags) MyProc->delayChkptFlags &= ~DELAY_CHKPT_START; @@ -5538,6 +5557,7 @@ UnlockBuffers(void) if (buf) { uint32 buf_state; + uint32 unset_bits = 0; buf_state = LockBufHdr(buf); @@ -5547,9 +5567,11 @@ UnlockBuffers(void) */ if ((buf_state & BM_PIN_COUNT_WAITER) != 0 && buf->wait_backend_pgprocno == MyProcNumber) - buf_state &= ~BM_PIN_COUNT_WAITER; + unset_bits = BM_PIN_COUNT_WAITER; - UnlockBufHdr(buf, buf_state); + UnlockBufHdrExt(buf, buf_state, + 0, unset_bits, + 0); PinCountWaitBuf = NULL; } @@ -5667,6 +5689,7 @@ LockBufferForCleanup(Buffer buffer) for (;;) { uint32 buf_state; + uint32 unset_bits = 0; /* Try to acquire lock */ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); @@ -5676,7 +5699,7 @@ LockBufferForCleanup(Buffer buffer) if (BUF_STATE_GET_REFCOUNT(buf_state) == 1) { /* Successfully acquired exclusive lock with pincount 1 */ - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); /* * Emit the log message if recovery conflict on buffer pin was @@ -5699,14 +5722,15 @@ LockBufferForCleanup(Buffer buffer) /* Failed, so mark myself as waiting for pincount 1 */ if (buf_state & BM_PIN_COUNT_WAITER) { - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); LockBuffer(buffer, BUFFER_LOCK_UNLOCK); elog(ERROR, "multiple backends attempting to wait for pincount 1"); } bufHdr->wait_backend_pgprocno = MyProcNumber; PinCountWaitBuf = bufHdr; - buf_state |= BM_PIN_COUNT_WAITER; - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdrExt(bufHdr, buf_state, + BM_PIN_COUNT_WAITER, 0, + 0); LockBuffer(buffer, BUFFER_LOCK_UNLOCK); /* Wait to be signaled by UnpinBuffer() */ @@ -5768,8 +5792,11 @@ LockBufferForCleanup(Buffer buffer) buf_state = LockBufHdr(bufHdr); if ((buf_state & BM_PIN_COUNT_WAITER) != 0 && bufHdr->wait_backend_pgprocno == MyProcNumber) - buf_state &= ~BM_PIN_COUNT_WAITER; - UnlockBufHdr(bufHdr, buf_state); + unset_bits |= BM_PIN_COUNT_WAITER; + + UnlockBufHdrExt(bufHdr, buf_state, + 0, unset_bits, + 0); PinCountWaitBuf = NULL; /* Loop back and try again */ @@ -5846,12 +5873,12 @@ ConditionalLockBufferForCleanup(Buffer buffer) if (refcount == 1) { /* Successfully acquired exclusive lock with pincount 1 */ - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); return true; } /* Failed, so release the lock */ - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); LockBuffer(buffer, BUFFER_LOCK_UNLOCK); return false; } @@ -5899,11 +5926,11 @@ IsBufferCleanupOK(Buffer buffer) if (BUF_STATE_GET_REFCOUNT(buf_state) == 1) { /* pincount is OK. */ - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); return true; } - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); return false; } @@ -5941,7 +5968,7 @@ WaitIO(BufferDesc *buf) * clearing the wref while it's being read. */ iow = buf->io_wref; - UnlockBufHdr(buf, buf_state); + UnlockBufHdr(buf); /* no IO in progress, we don't need to wait */ if (!(buf_state & BM_IO_IN_PROGRESS)) @@ -6009,7 +6036,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait) if (!(buf_state & BM_IO_IN_PROGRESS)) break; - UnlockBufHdr(buf, buf_state); + UnlockBufHdr(buf); if (nowait) return false; WaitIO(buf); @@ -6020,12 +6047,13 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait) /* Check if someone else already did the I/O */ if (forInput ? (buf_state & BM_VALID) : !(buf_state & BM_DIRTY)) { - UnlockBufHdr(buf, buf_state); + UnlockBufHdr(buf); return false; } - buf_state |= BM_IO_IN_PROGRESS; - UnlockBufHdr(buf, buf_state); + UnlockBufHdrExt(buf, buf_state, + BM_IO_IN_PROGRESS, 0, + 0); ResourceOwnerRememberBufferIO(CurrentResourceOwner, BufferDescriptorGetBuffer(buf)); @@ -6058,28 +6086,31 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits, bool forget_owner, bool release_aio) { uint32 buf_state; + uint32 unset_flag_bits = 0; + int refcount_change = 0; buf_state = LockBufHdr(buf); Assert(buf_state & BM_IO_IN_PROGRESS); - buf_state &= ~BM_IO_IN_PROGRESS; + unset_flag_bits |= BM_IO_IN_PROGRESS; /* Clear earlier errors, if this IO failed, it'll be marked again */ - buf_state &= ~BM_IO_ERROR; + unset_flag_bits |= BM_IO_ERROR; if (clear_dirty && !(buf_state & BM_JUST_DIRTIED)) - buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED); + unset_flag_bits |= BM_DIRTY | BM_CHECKPOINT_NEEDED; if (release_aio) { /* release ownership by the AIO subsystem */ Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0); - buf_state -= BUF_REFCOUNT_ONE; + refcount_change = -1; pgaio_wref_clear(&buf->io_wref); } - buf_state |= set_flag_bits; - UnlockBufHdr(buf, buf_state); + UnlockBufHdrExt(buf, buf_state, + set_flag_bits, unset_flag_bits, + refcount_change); if (forget_owner) ResourceOwnerForgetBufferIO(CurrentResourceOwner, @@ -6124,12 +6155,12 @@ AbortBufferIO(Buffer buffer) if (!(buf_state & BM_VALID)) { Assert(!(buf_state & BM_DIRTY)); - UnlockBufHdr(buf_hdr, buf_state); + UnlockBufHdr(buf_hdr); } else { Assert(buf_state & BM_DIRTY); - UnlockBufHdr(buf_hdr, buf_state); + UnlockBufHdr(buf_hdr); /* Issue notice if this is not the first failure... */ if (buf_state & BM_IO_ERROR) @@ -6551,14 +6582,14 @@ EvictUnpinnedBufferInternal(BufferDesc *desc, bool *buffer_flushed) if ((buf_state & BM_VALID) == 0) { - UnlockBufHdr(desc, buf_state); + UnlockBufHdr(desc); return false; } /* Check that it's not pinned already. */ if (BUF_STATE_GET_REFCOUNT(buf_state) > 0) { - UnlockBufHdr(desc, buf_state); + UnlockBufHdr(desc); return false; } @@ -6710,7 +6741,7 @@ EvictRelUnpinnedBuffers(Relation rel, int32 *buffers_evicted, if ((buf_state & BM_VALID) == 0 || !BufTagMatchesRelFileLocator(&desc->tag, &rel->rd_locator)) { - UnlockBufHdr(desc, buf_state); + UnlockBufHdr(desc); continue; } @@ -6808,13 +6839,15 @@ buffer_stage_common(PgAioHandle *ioh, bool is_write, bool is_temp) * * This pin is released again in TerminateBufferIO(). */ - buf_state += BUF_REFCOUNT_ONE; buf_hdr->io_wref = io_ref; if (is_temp) + { + buf_state += BUF_REFCOUNT_ONE; pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state); + } else - UnlockBufHdr(buf_hdr, buf_state); + UnlockBufHdrExt(buf_hdr, buf_state, 0, 0, 1); /* * Ensure the content lock that prevents buffer modifications while diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 9d9fb0471a0..c98080a7d0b 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -259,6 +259,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r break; } + /* See equivalent code in PinBuffer() */ if (unlikely(local_buf_state & BM_LOCKED)) { old_buf_state = WaitBufHdrUnlocked(buf); @@ -688,6 +689,7 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state) || BUF_STATE_GET_USAGECOUNT(local_buf_state) > 1) break; + /* See equivalent code in PinBuffer() */ if (unlikely(local_buf_state & BM_LOCKED)) { old_buf_state = WaitBufHdrUnlocked(buf); diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index 3df04c98959..ab790533ff6 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -220,7 +220,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) else fctx->record[i].isvalid = false; - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); } } @@ -460,7 +460,6 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) { char *buffptr = (char *) BufferGetBlock(i + 1); BufferDesc *bufHdr; - uint32 buf_state; uint32 bufferid; int32 page_num; char *startptr_buff, @@ -471,9 +470,9 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) bufHdr = GetBufferDescriptor(i); /* Lock each buffer header before inspecting. */ - buf_state = LockBufHdr(bufHdr); + LockBufHdr(bufHdr); bufferid = BufferDescriptorGetBuffer(bufHdr); - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); /* start of the first page of this buffer */ startptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr); diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index 8b68dafc261..5ba1240d51f 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -730,7 +730,7 @@ apw_dump_now(bool is_bgworker, bool dump_unlogged) ++num_blocks; } - UnlockBufHdr(bufHdr, buf_state); + UnlockBufHdr(bufHdr); } snprintf(transient_dump_file_path, MAXPGPATH, "%s.tmp", AUTOPREWARM_FILE); diff --git a/src/test/modules/test_aio/test_aio.c b/src/test/modules/test_aio/test_aio.c index c55cf6c0aac..d7eadeab256 100644 --- a/src/test/modules/test_aio/test_aio.c +++ b/src/test/modules/test_aio/test_aio.c @@ -310,6 +310,7 @@ create_toy_buffer(Relation rel, BlockNumber blkno) BufferDesc *buf_hdr; uint32 buf_state; bool was_pinned = false; + uint32 unset_bits = 0; /* place buffer in shared buffers without erroring out */ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_ZERO_AND_LOCK, NULL); @@ -334,12 +335,17 @@ create_toy_buffer(Relation rel, BlockNumber blkno) if (BUF_STATE_GET_REFCOUNT(buf_state) > 1) was_pinned = true; else - buf_state &= ~(BM_VALID | BM_DIRTY); + unset_bits |= BM_VALID | BM_DIRTY; if (RelationUsesLocalBuffers(rel)) + { + buf_state &= ~unset_bits; pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state); + } else - UnlockBufHdr(buf_hdr, buf_state); + { + UnlockBufHdrExt(buf_hdr, buf_state, 0, unset_bits, 0); + } if (was_pinned) elog(ERROR, "toy buffer %d was already pinned", -- 2.48.1.76.g4e746b1a31.dirty
>From e763e0073f1b6c82b77be77fdd101ce3e54f650f Mon Sep 17 00:00:00 2001 From: Andres Freund <[email protected]> Date: Mon, 22 Sep 2025 14:44:52 -0400 Subject: [PATCH v4 4/6] bufmgr: Use atomic sub for unpinning buffers The prior commit made it legal to modify BufferDesc.state while the buffer header spinlock is held. This allows us to replace the CAS loop inUnpinBufferNoOwner() with an atomic sub. This improves scalability significantly. See the prior commits for more background. Reviewed-by: Robert Haas <[email protected]> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff --- src/backend/storage/buffer/bufmgr.c | 29 +++-------------------------- 1 file changed, 3 insertions(+), 26 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 0ce4a736b58..594a8e5d512 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -3247,7 +3247,6 @@ UnpinBufferNoOwner(BufferDesc *buf) ref->refcount--; if (ref->refcount == 0) { - uint32 buf_state; uint32 old_buf_state; /* @@ -3262,33 +3261,11 @@ UnpinBufferNoOwner(BufferDesc *buf) /* I'd better not still hold the buffer content lock */ Assert(!LWLockHeldByMe(BufferDescriptorGetContentLock(buf))); - /* - * Decrement the shared reference count. - * - * Since buffer spinlock holder can update status using just write, - * it's not safe to use atomic decrement here; thus use a CAS loop. - * - * TODO: The above requirement does not hold anymore, in a future - * commit this will be rewritten to release the pin in a single atomic - * operation. - */ - old_buf_state = pg_atomic_read_u32(&buf->state); - for (;;) - { - if (old_buf_state & BM_LOCKED) - old_buf_state = WaitBufHdrUnlocked(buf); - - buf_state = old_buf_state; - - buf_state -= BUF_REFCOUNT_ONE; - - if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state, - buf_state)) - break; - } + /* decrement the shared reference count */ + old_buf_state = pg_atomic_fetch_sub_u32(&buf->state, BUF_REFCOUNT_ONE); /* Support LockBufferForCleanup() */ - if (buf_state & BM_PIN_COUNT_WAITER) + if (old_buf_state & BM_PIN_COUNT_WAITER) WakePinCountWaiter(buf); ForgetPrivateRefCountEntry(ref); -- 2.48.1.76.g4e746b1a31.dirty
>From c044083cde67f216c12c1d6a481a8b4390a5882f Mon Sep 17 00:00:00 2001 From: Andres Freund <[email protected]> Date: Mon, 30 Jun 2025 16:54:39 -0400 Subject: [PATCH v4 5/6] bufmgr: fewer calls to BufferDescriptorGetContentLock We're planning to merge buffer content locks into BufferDesc.state. To reduce the size of that patch, centralize BufferDescriptorGetContentLock(). The biggest part of the change is in assertions, by introducing BufferIsLockedByMe[InMode]() (and removing BufferIsExclusiveLocked()). This seems like an improvement even without aforementioned plans. Additionally replace some direct calls to LWLockAcquire() with calls to LockBuffer(). Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- src/include/storage/bufmgr.h | 3 +- src/backend/access/heap/visibilitymap.c | 3 +- src/backend/access/transam/xloginsert.c | 3 +- src/backend/storage/buffer/bufmgr.c | 70 +++++++++++++++++++------ 4 files changed, 61 insertions(+), 18 deletions(-) diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index 47360a3d3d8..3f37b294af6 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -230,7 +230,8 @@ extern void WaitReadBuffers(ReadBuffersOperation *operation); extern void ReleaseBuffer(Buffer buffer); extern void UnlockReleaseBuffer(Buffer buffer); -extern bool BufferIsExclusiveLocked(Buffer buffer); +extern bool BufferIsLockedByMe(Buffer buffer); +extern bool BufferIsLockedByMeInMode(Buffer buffer, int mode); extern bool BufferIsDirty(Buffer buffer); extern void MarkBufferDirty(Buffer buffer); extern void IncrBufferRefCount(Buffer buffer); diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c index 7306c16f05c..0414ce1945c 100644 --- a/src/backend/access/heap/visibilitymap.c +++ b/src/backend/access/heap/visibilitymap.c @@ -270,7 +270,8 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf, if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk) elog(ERROR, "wrong heap buffer passed to visibilitymap_set"); - Assert(!BufferIsValid(heapBuf) || BufferIsExclusiveLocked(heapBuf)); + Assert(!BufferIsValid(heapBuf) || + BufferIsLockedByMeInMode(heapBuf, BUFFER_LOCK_EXCLUSIVE)); /* Check that we have the right VM page pinned */ if (!BufferIsValid(vmBuf) || BufferGetBlockNumber(vmBuf) != mapBlock) diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index c7571429e8e..496e0fa4ac6 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -258,7 +258,8 @@ XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags) */ #ifdef USE_ASSERT_CHECKING if (!(flags & REGBUF_NO_CHANGE)) - Assert(BufferIsExclusiveLocked(buffer) && BufferIsDirty(buffer)); + Assert(BufferIsLockedByMeInMode(buffer, BUFFER_LOCK_EXCLUSIVE) && + BufferIsDirty(buffer)); #endif if (block_id >= max_registered_block_id) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 594a8e5d512..350b35362ce 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1065,7 +1065,7 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid) * already valid.) */ if (!isLocalBuf) - LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE); + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* Set BM_VALID, terminate IO, and wake up any waiters */ if (isLocalBuf) @@ -2822,7 +2822,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, } if (lock) - LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE); + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); TerminateBufferIO(buf_hdr, false, BM_VALID, true, false); } @@ -2835,14 +2835,14 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, } /* - * BufferIsExclusiveLocked + * BufferIsLockedByMe * - * Checks if buffer is exclusive-locked. + * Checks if this backend has the buffer locked in any mode. * * Buffer must be pinned. */ bool -BufferIsExclusiveLocked(Buffer buffer) +BufferIsLockedByMe(Buffer buffer) { BufferDesc *bufHdr; @@ -2855,9 +2855,49 @@ BufferIsExclusiveLocked(Buffer buffer) } else { + bufHdr = GetBufferDescriptor(buffer - 1); + return LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)); + } +} + +/* + * BufferIsLockedByMeInMode + * + * Checks if this backend has the buffer locked in the specified mode. + * + * Buffer must be pinned. + */ +bool +BufferIsLockedByMeInMode(Buffer buffer, int mode) +{ + BufferDesc *bufHdr; + + Assert(BufferIsPinned(buffer)); + + if (BufferIsLocal(buffer)) + { + /* Content locks are not maintained for local buffers. */ + return true; + } + else + { + LWLockMode lw_mode; + + switch (mode) + { + case BUFFER_LOCK_EXCLUSIVE: + lw_mode = LW_EXCLUSIVE; + break; + case BUFFER_LOCK_SHARE: + lw_mode = LW_SHARED; + break; + default: + pg_unreachable(); + } + bufHdr = GetBufferDescriptor(buffer - 1); return LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), - LW_EXCLUSIVE); + lw_mode); } } @@ -2886,8 +2926,7 @@ BufferIsDirty(Buffer buffer) else { bufHdr = GetBufferDescriptor(buffer - 1); - Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), - LW_EXCLUSIVE)); + Assert(BufferIsLockedByMeInMode(buffer, BUFFER_LOCK_EXCLUSIVE)); } return pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY; @@ -2921,8 +2960,7 @@ MarkBufferDirty(Buffer buffer) bufHdr = GetBufferDescriptor(buffer - 1); Assert(BufferIsPinned(buffer)); - Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), - LW_EXCLUSIVE)); + Assert(BufferIsLockedByMeInMode(buffer, BUFFER_LOCK_EXCLUSIVE)); /* * NB: We have to wait for the buffer header spinlock to be not held, as @@ -3258,7 +3296,10 @@ UnpinBufferNoOwner(BufferDesc *buf) */ VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ); - /* I'd better not still hold the buffer content lock */ + /* + * I'd better not still hold the buffer content lock. Can't use + * BufferIsLockedByMe(), as that asserts the buffer is pinned. + */ Assert(!LWLockHeldByMe(BufferDescriptorGetContentLock(buf))); /* decrement the shared reference count */ @@ -5311,7 +5352,7 @@ FlushOneBuffer(Buffer buffer) bufHdr = GetBufferDescriptor(buffer - 1); - Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr))); + Assert(BufferIsLockedByMe(buffer)); FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL); } @@ -5402,7 +5443,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std) Assert(GetPrivateRefCount(buffer) > 0); /* here, either share or exclusive lock is OK */ - Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr))); + Assert(BufferIsLockedByMe(buffer)); /* * This routine might get called many times on the same page, if we are @@ -5894,8 +5935,7 @@ IsBufferCleanupOK(Buffer buffer) bufHdr = GetBufferDescriptor(buffer - 1); /* caller must hold exclusive lock on buffer */ - Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), - LW_EXCLUSIVE)); + Assert(BufferIsLockedByMeInMode(buffer, BUFFER_LOCK_EXCLUSIVE)); buf_state = LockBufHdr(bufHdr); -- 2.48.1.76.g4e746b1a31.dirty
>From 3a70f842d5b5dfd5e8e477bc9cea918efbb1b8ec Mon Sep 17 00:00:00 2001 From: Andres Freund <[email protected]> Date: Mon, 30 Jun 2025 14:30:38 -0400 Subject: [PATCH v4 6/6] bufmgr: Introduce FlushUnlockedBuffer There were several copies of code locking a buffer, flushing its contents, and unlocking the buffer. It seems worth centralizing that into a helper function. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- src/backend/storage/buffer/bufmgr.c | 36 ++++++++++++++++------------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 350b35362ce..ff961ec46d9 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -533,6 +533,8 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr, static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress); static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete); static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context); +static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln, + IOObject io_object, IOContext io_context); static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object, IOContext io_context); static void FindAndDropRelationBuffers(RelFileLocator rlocator, @@ -3965,11 +3967,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context) * buffer is clean by the time we've locked it.) */ PinBuffer_Locked(bufHdr); - LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED); - FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL); - - LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + FlushUnlockedBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL); tag = bufHdr->tag; @@ -4417,6 +4416,19 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object, error_context_stack = errcallback.previous; } +/* + * Convenience wrapper around FlushBuffer() that locks/unlocks the buffer + * before/after calling FlushBuffer(). + */ +static void +FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln, + IOObject io_object, IOContext io_context) +{ + LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED); + FlushBuffer(buf, reln, IOOBJECT_RELATION, IOCONTEXT_NORMAL); + LWLockRelease(BufferDescriptorGetContentLock(buf)); +} + /* * RelationGetNumberOfBlocksInFork * Determines the current number of pages in the specified relation fork. @@ -5001,9 +5013,7 @@ FlushRelationBuffers(Relation rel) (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) { PinBuffer_Locked(bufHdr); - LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED); - FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL); - LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + FlushUnlockedBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL); UnpinBuffer(bufHdr); } else @@ -5098,9 +5108,7 @@ FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels) (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) { PinBuffer_Locked(bufHdr); - LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED); - FlushBuffer(bufHdr, srelent->srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL); - LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + FlushUnlockedBuffer(bufHdr, srelent->srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL); UnpinBuffer(bufHdr); } else @@ -5326,9 +5334,7 @@ FlushDatabaseBuffers(Oid dbid) (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) { PinBuffer_Locked(bufHdr); - LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED); - FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL); - LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + FlushUnlockedBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL); UnpinBuffer(bufHdr); } else @@ -6615,10 +6621,8 @@ EvictUnpinnedBufferInternal(BufferDesc *desc, bool *buffer_flushed) /* If it was dirty, try to clean it once. */ if (buf_state & BM_DIRTY) { - LWLockAcquire(BufferDescriptorGetContentLock(desc), LW_SHARED); - FlushBuffer(desc, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL); + FlushUnlockedBuffer(desc, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL); *buffer_flushed = true; - LWLockRelease(BufferDescriptorGetContentLock(desc)); } /* This will return false if it becomes dirty or someone else pins it. */ -- 2.48.1.76.g4e746b1a31.dirty
