Hi, I looked at this again, and I think the reason is mostly obvious. Both why it's trashing, and why it happens with checksums=on ...
The reason why it happens is that PinBuffer does this: old_buf_state = pg_atomic_read_u32(&buf->state); for (;;) { if (old_buf_state & BM_LOCKED) old_buf_state = WaitBufHdrUnlocked(buf); buf_state = old_buf_state; ... modify state ... if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state, buf_state)) { ... break; } } So, we read the buffer state (which is where pins are tracked), possibly waiting for it to get unlocked. Then we modify the state, and update it, but only if it didn't change. If it did change, we retry. Of course, as the number of sessions grows, the probability of something updating the state in between increases. Another session might have pinned the buffer, for example. This causes retries. I added a couple counters to track how many loops are needed, and with 96 clients this needs about 800k retries per 100k calls, so about 8 retries per call. With 32 clients, this needs only about 25k retries, so 0.25 retry / call. That's a huge difference. I believe enabling data checksums simply makes it more severe, because the BufferGetLSNAtomic() has to obtain header lock, which uses the same "state" field, with exactly the same retry logic. It can probably happen even without it, but as the lock is exclusive, it also "serializes" the access, making the conflicts more likely. BufferGetLSNAtomic does this: bufHdr = GetBufferDescriptor(buffer - 1); buf_state = LockBufHdr(bufHdr); lsn = PageGetLSN(page); UnlockBufHdr(bufHdr, buf_state); AFAICS the lock is needed simply to read a consistent value from the page header, but maybe we could have an atomic variable with a copy of the LSN in the buffer descriptor? regards -- Tomas Vondra
| --91.21%--btgettuple | |--58.16%--_bt_first | | | |--41.47%--_bt_search | | | | | --41.07%--_bt_relandgetbuf | | | | | |--39.39%--ReadBufferExtended | | | StartReadBuffer | | | | | | | --38.46%--PinBuffer | | | | | | | |--29.14%--WaitBufHdrUnlocked (inlined) | | | | | | | --8.83%--pg_atomic_compare_exchange_u32 (inlined) | | | pg_atomic_compare_exchange_u32_impl (inlined) | | | | | --1.63%--_bt_lockbuf (inlined) | | LWLockAcquire | | | | | --1.62%--LWLockAttemptLock (inlined) | | | | | --1.37%--pg_atomic_compare_exchange_u32 (inlined) | | pg_atomic_compare_exchange_u32_impl (inlined) | | | --16.51%--_bt_readfirstpage | | | |--15.45%--_bt_readpage | | | | | |--14.29%--BufferGetLSNAtomic | | | | | | | --13.86%--LockBufHdr | | | | | --0.67%--BufferGetBlockNumber | | | --1.06%--LWLockRelease | LWLockReleaseInternal | pg_atomic_sub_fetch_u32 (inlined) | pg_atomic_sub_fetch_u32_impl (inlined) | pg_atomic_fetch_sub_u32_impl (inlined) | --33.05%--_bt_next | --33.03%--_bt_steppage | |--32.41%--UnpinBufferNoOwner | | | --7.30%--pg_atomic_compare_exchange_u32 (inlined) | pg_atomic_compare_exchange_u32_impl (inlined) | --0.61%--ReleaseBuffer UnpinBuffer (inlined) BufferDescriptorGetBuffer (inlined)