Good day, Kyotaro Horiguchi and hackers. В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет: > At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <[email protected]> > wrote in > > Hello, all. > > > > I thought about patch simplification, and tested version > > without BufTable and dynahash api change at all. > > > > It performs suprisingly well. It is just a bit worse > > than v1 since there is more contention around dynahash's > > freelist, but most of improvement remains. > > > > I'll finish benchmarking and will attach graphs with > > next message. Patch is attached here. > > Thanks for the new patch. The patch as a whole looks fine to me. But > some comments needs to be revised.
Thank you for review and remarks.
>
> (existing comments)
> > * To change the association of a valid buffer, we'll need to have
> > * exclusive lock on both the old and new mapping partitions.
> ...
> > * Somebody could have pinned or re-dirtied the buffer while we were
> > * doing the I/O and making the new hashtable entry. If so, we can't
> > * recycle this buffer; we must undo everything we've done and start
> > * over with a new victim buffer.
>
> We no longer take a lock on the new partition and have no new hash
> entry (if others have not yet done) at this point.
fixed
> + * Clear out the buffer's tag and flags. We must do this to ensure
> that
> + * linear scans of the buffer array don't think the buffer is valid.
> We
>
> The reason we can clear out the tag is it's safe to use the victim
> buffer at this point. This comment needs to mention that reason.
Tried to describe.
> + *
> + * Since we are single pinner, there should no be PIN_COUNT_WAITER or
> + * IO_IN_PROGRESS (flags that were not cleared in previous code).
> + */
> + Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
>
> It seems like to be a test for potential bugs in other functions. As
> the comment is saying, we are sure that no other processes are pinning
> the buffer and the existing code doesn't seem to be care about that
> condition. Is it really needed?
Ok, I agree this check is excess.
These two flags were not cleared in the previous code, and I didn't get
why. Probably, it is just a historical accident.
>
> + /*
> + * Try to make a hashtable entry for the buffer under its new tag.
> This
> + * could fail because while we were writing someone else allocated
> another
>
> The most significant point of this patch is the reason that the victim
> buffer is protected from stealing until it is set up for new tag. I
> think we need an explanation about the protection here.
I don't get what you mean clearly :( . I would appreciate your
suggestion for this comment.
>
>
> + * buffer for the same block we want to read in. Note that we have
> not yet
> + * removed the hashtable entry for the old tag.
>
> Since we have removed the hash table entry for the old tag at this
> point, the comment got wrong.
Thanks, changed.
> + * the first place. First, give up the buffer we were
> planning to use
> + * and put it to free lists.
> ..
> + StrategyFreeBuffer(buf);
>
> This is one downside of this patch. But it seems to me that the odds
> are low that many buffers are freed in a short time by this logic. By
> the way it would be better if the sentence starts with "First" has a
> separate comment section.
Splitted the comment.
> (existing comment)
> | * Okay, it's finally safe to rename the buffer.
>
> We don't "rename" the buffer here. And the safety is already
> establishsed at the end of the oldPartitionLock section. So it would
> be just something like "Now allocate the victim buffer for the new
> tag"?
Changed to "Now it is safe to use victim buffer for new tag."
There is also tiny code change at block reuse finalization: instead
of LockBufHdr+UnlockBufHdr I use single atomic_fetch_or protected
with WaitBufHdrUnlocked. I've tried to explain its safety. Please,
check it.
Benchmarks:
- base point is 6ce16088bfed97f9.
- notebook with i7-1165G7 and server with Xeon 8354H (1&2 sockets)
- pgbench select only scale 100 (1.5GB on disk)
- two shared_buffers values: 128MB and 1GB.
- enabled hugepages
- second best result from five runs
Notebook:
conns | master | patch_v3 | master 1G | patch_v3 1G
--------+------------+------------+------------+------------
1 | 29508 | 29481 | 31774 | 32305
2 | 57139 | 56694 | 63393 | 62968
3 | 89759 | 90861 | 101873 | 102399
5 | 133491 | 134573 | 145451 | 145009
7 | 156739 | 155832 | 164633 | 164562
17 | 216863 | 216379 | 251923 | 251017
27 | 209532 | 209802 | 244202 | 243709
53 | 212615 | 213552 | 248107 | 250317
83 | 214446 | 218230 | 252414 | 252337
107 | 211276 | 217109 | 252762 | 250328
139 | 208070 | 214265 | 248350 | 249684
163 | 206764 | 214594 | 247369 | 250323
191 | 205478 | 213511 | 244597 | 246877
211 | 200976 | 212976 | 244035 | 245032
239 | 196588 | 211519 | 243897 | 245055
271 | 195813 | 209631 | 237457 | 242771
307 | 192724 | 208074 | 237658 | 241759
353 | 187847 | 207189 | 234548 | 239008
397 | 186942 | 205317 | 230465 | 238782
I don't get why numbers changed from first letter ))
But still no slowdown, and measurable gain at 128MB shared
buffers.
Xeon 1 socket
conns | master | patch_v3 | master 1G | patch_v3 1G
--------+------------+------------+------------+------------
1 | 41975 | 41799 | 52898 | 52715
2 | 77693 | 77531 | 97571 | 98547
3 | 114713 | 114533 | 142709 | 143579
5 | 188898 | 187241 | 239322 | 236682
7 | 261516 | 260249 | 329119 | 328562
17 | 521821 | 518981 | 672390 | 662987
27 | 555487 | 557019 | 674630 | 675703
53 | 868213 | 897097 | 1190734 | 1202575
83 | 868232 | 881705 | 1164997 | 1157764
107 | 850477 | 855169 | 1140597 | 1128027
139 | 816311 | 826756 | 1101471 | 1096197
163 | 794788 | 805946 | 1078445 | 1071535
191 | 765934 | 783209 | 1059497 | 1039936
211 | 738656 | 786171 | 1083356 | 1049025
239 | 713124 | 837040 | 1104629 | 1125969
271 | 692138 | 847741 | 1094432 | 1131968
307 | 682919 | 847939 | 1086306 | 1124649
353 | 679449 | 844596 | 1071482 | 1125980
397 | 676217 | 833009 | 1058937 | 1113496
Here is small slowdown at some connection numbers (17,
107-191).It is reproducible. Probably it is due to one more
atomice write. Perhaps for some other scheduling issues (
processes block less on buffer manager but compete more
on other resources). I could not reliably determine why,
because change is too small, and `perf record` harms
performance more at this point.
This is the reason I've changed finalization to atomic_or
instead of Lock+Unlock pair. The changed helped a bit, but
didn't remove slowdown completely.
Xeon 2 socket
conns | m0 | patch_v3 | m0 1G | patch_v3 1G
--------+------------+------------+------------+------------
1 | 44317 | 43747 | 53920 | 53759
2 | 81193 | 79976 | 99138 | 99213
3 | 120755 | 114481 | 148102 | 146494
5 | 190007 | 187384 | 232078 | 229627
7 | 258602 | 256657 | 325545 | 322417
17 | 551814 | 549041 | 692312 | 688204
27 | 787353 | 787916 | 1023509 | 1020995
53 | 973880 | 996019 | 1228274 | 1246128
83 | 1108442 | 1258589 | 1596292 | 1662586
107 | 1072188 | 1317229 | 1542401 | 1684603
139 | 1000446 | 1272759 | 1490757 | 1672507
163 | 967378 | 1224513 | 1461468 | 1660012
191 | 926010 | 1178067 | 1435317 | 1645886
211 | 909919 | 1148862 | 1417437 | 1629487
239 | 895944 | 1108579 | 1393530 | 1616824
271 | 880545 | 1078280 | 1374878 | 1608412
307 | 865560 | 1056988 | 1355164 | 1601066
353 | 857591 | 1033980 | 1330069 | 1586769
397 | 840374 | 1016690 | 1312257 | 1573376
regards,
Yura Sokolov
Postgres Professional
[email protected]
[email protected]
From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001 From: Yura Sokolov <[email protected]> Date: Mon, 21 Feb 2022 08:49:03 +0300 Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks. Acquiring two partition locks leads to complex dependency chain that hurts at high concurrency level. There is no need to hold both lock simultaneously. Buffer is pinned so other processes could not select it for eviction. If tag is cleared and buffer removed from old partition other processes will not find it. Therefore it is safe to release old partition lock before acquiring new partition lock. Tags: lwlock_numa --- src/backend/storage/buffer/bufmgr.c | 206 ++++++++++++++-------------- 1 file changed, 105 insertions(+), 101 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index b4532948d3f..bb8b1cd2f4b 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1114,6 +1114,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, BufferDesc *buf; bool valid; uint32 buf_state; + uint32 new_bits; /* create a tag so we can lookup the buffer */ INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum); @@ -1288,93 +1289,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, oldHash = BufTableHashCode(&oldTag); oldPartitionLock = BufMappingPartitionLock(oldHash); - /* - * Must lock the lower-numbered partition first to avoid - * deadlocks. - */ - if (oldPartitionLock < newPartitionLock) - { - LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE); - LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); - } - else if (oldPartitionLock > newPartitionLock) - { - LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); - LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE); - } - else - { - /* only one partition, only one lock */ - LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); - } + LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE); } else { - /* if it wasn't valid, we need only the new partition */ - LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); /* remember we have no old-partition lock or tag */ oldPartitionLock = NULL; /* keep the compiler quiet about uninitialized variables */ oldHash = 0; } - /* - * Try to make a hashtable entry for the buffer under its new tag. - * This could fail because while we were writing someone else - * allocated another buffer for the same block we want to read in. - * Note that we have not yet removed the hashtable entry for the old - * tag. - */ - buf_id = BufTableInsert(&newTag, newHash, buf->buf_id); - - if (buf_id >= 0) - { - /* - * Got a collision. Someone has already done what we were about to - * do. We'll just handle this as if it were found in the buffer - * pool in the first place. First, give up the buffer we were - * planning to use. - */ - UnpinBuffer(buf, true); - - /* Can give up that buffer's mapping partition lock now */ - if (oldPartitionLock != NULL && - oldPartitionLock != newPartitionLock) - LWLockRelease(oldPartitionLock); - - /* remaining code should match code at top of routine */ - - buf = GetBufferDescriptor(buf_id); - - valid = PinBuffer(buf, strategy); - - /* Can release the mapping lock as soon as we've pinned it */ - LWLockRelease(newPartitionLock); - - *foundPtr = true; - - if (!valid) - { - /* - * We can only get here if (a) someone else is still reading - * in the page, or (b) a previous read attempt failed. We - * have to wait for any active read attempt to finish, and - * then set up our own read attempt if the page is still not - * BM_VALID. StartBufferIO does it all. - */ - if (StartBufferIO(buf, true)) - { - /* - * If we get here, previous attempts to read the buffer - * must have failed ... but we shall bravely try again. - */ - *foundPtr = false; - } - } - - return buf; - } - /* * Need to lock the buffer header too in order to change its tag. */ @@ -1382,52 +1306,132 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, /* * Somebody could have pinned or re-dirtied the buffer while we were - * doing the I/O and making the new hashtable entry. If so, we can't - * recycle this buffer; we must undo everything we've done and start - * over with a new victim buffer. + * doing the I/O. If so, we can't recycle this buffer; we must undo + * everything we've done and start over with a new victim buffer. */ oldFlags = buf_state & BUF_FLAG_MASK; if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY)) break; UnlockBufHdr(buf, buf_state); - BufTableDelete(&newTag, newHash); - if (oldPartitionLock != NULL && - oldPartitionLock != newPartitionLock) + if (oldPartitionLock != NULL) LWLockRelease(oldPartitionLock); - LWLockRelease(newPartitionLock); UnpinBuffer(buf, true); } /* - * Okay, it's finally safe to rename the buffer. + * Clear out the buffer's tag and flags. We must do this to ensure that + * linear scans of the buffer array don't think the buffer is valid. We + * also reset the usage_count since any recency of use of the old content + * is no longer relevant. * - * Clearing BM_VALID here is necessary, clearing the dirtybits is just - * paranoia. We also reset the usage_count since any recency of use of - * the old content is no longer relevant. (The usage_count starts out at - * 1 so that the buffer can survive one clock-sweep pass.) + * We have pinned buffer and we are single pinner at the moment so there + * is no other pinners. We hold buffer header lock and exclusive partition + * lock if tag is valid. Given these statements it is safe to clear tag + * since no other process can inspect it to the moment. + */ + CLEAR_BUFFERTAG(buf->tag); + buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK); + UnlockBufHdr(buf, buf_state); + + /* Delete old tag from hash table if it were valid. */ + if (oldFlags & BM_TAG_VALID) + BufTableDelete(&oldTag, oldHash); + + if (oldPartitionLock != newPartitionLock) + { + if (oldPartitionLock != NULL) + LWLockRelease(oldPartitionLock); + LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); + } + + /* + * Try to make a hashtable entry for the buffer under its new tag. This + * could fail because while we were writing someone else allocated another + * buffer for the same block we want to read in. In that case we will have + * to return our buffer to free list. + */ + buf_id = BufTableInsert(&newTag, newHash, buf->buf_id); + + if (buf_id >= 0) + { + /* + * Got a collision. Someone has already done what we were about to do. + * We'll just handle this as if it were found in the buffer pool in + * the first place. + */ + + /* + * First, give up the buffer we were planning to use and put it to + * free lists. + */ + UnpinBuffer(buf, true); + StrategyFreeBuffer(buf); + + /* remaining code should match code at top of routine */ + + buf = GetBufferDescriptor(buf_id); + + valid = PinBuffer(buf, strategy); + + /* Can release the mapping lock as soon as we've pinned it */ + LWLockRelease(newPartitionLock); + + *foundPtr = true; + + if (!valid) + { + /* + * We can only get here if (a) someone else is still reading in + * the page, or (b) a previous read attempt failed. We have to + * wait for any active read attempt to finish, and then set up our + * own read attempt if the page is still not BM_VALID. + * StartBufferIO does it all. + */ + if (StartBufferIO(buf, true)) + { + /* + * If we get here, previous attempts to read the buffer must + * have failed ... but we shall bravely try again. + */ + *foundPtr = false; + } + } + + return buf; + } + + /* + * Now it is safe to use victim buffer for new tag. * * Make sure BM_PERMANENT is set for buffers that must be written at every * checkpoint. Unlogged buffers only need to be written at shutdown * checkpoints, except for their "init" forks, which need to be treated * just like permanent relations. + * + * The usage_count starts out at 1 so that the buffer can survive one + * clock-sweep pass. + * + * We use direct atomic OR instead of Lock+Unlock since no other backend + * could be interested in the buffer. But StrategyGetBuffer, + * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to + * compare tag, and UnlockBufHdr does raw write to state. So we have to + * spin if we found buffer locked. + * + * Note that we write tag unlocked. It is also safe since there is always + * check for BM_VALID when tag is compared. */ buf->tag = newTag; - buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | - BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT | - BUF_USAGECOUNT_MASK); if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM) - buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE; + new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE; else - buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE; - - UnlockBufHdr(buf, buf_state); + new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE; - if (oldPartitionLock != NULL) + buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits); + while (unlikely(buf_state & BM_LOCKED)) { - BufTableDelete(&oldTag, oldHash); - if (oldPartitionLock != newPartitionLock) - LWLockRelease(oldPartitionLock); + WaitBufHdrUnlocked(buf); + buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits); } LWLockRelease(newPartitionLock); -- 2.35.1
<<attachment: res.zip>>
