Hi,

I looked at this again, and I think the reason is mostly obvious. Both
why it's trashing, and why it happens with checksums=on ...

The reason why it happens is that PinBuffer does this:

    old_buf_state = pg_atomic_read_u32(&buf->state);
    for (;;)
    {
        if (old_buf_state & BM_LOCKED)
            old_buf_state = WaitBufHdrUnlocked(buf);

        buf_state = old_buf_state;

        ... modify state ...

        if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,
                           buf_state))
        {
        ...
        break;
        }
    }

So, we read the buffer state (which is where pins are tracked), possibly
waiting for it to get unlocked. Then we modify the state, and update it,
but only if it didn't change. If it did change, we retry.

Of course, as the number of sessions grows, the probability of something
updating the state in between increases. Another session might have
pinned the buffer, for example. This causes retries.

I added a couple counters to track how many loops are needed, and with
96 clients this needs about 800k retries per 100k calls, so about 8
retries per call. With 32 clients, this needs only about 25k retries, so
0.25 retry / call. That's a huge difference.

I believe enabling data checksums simply makes it more severe, because
the BufferGetLSNAtomic() has to obtain header lock, which uses the same
"state" field, with exactly the same retry logic. It can probably happen
even without it, but as the lock is exclusive, it also "serializes" the
access, making the conflicts more likely.

BufferGetLSNAtomic does this:

    bufHdr = GetBufferDescriptor(buffer - 1);
    buf_state = LockBufHdr(bufHdr);
    lsn = PageGetLSN(page);
    UnlockBufHdr(bufHdr, buf_state);

AFAICS the lock is needed simply to read a consistent value from the
page header, but maybe we could have an atomic variable with a copy of
the LSN in the buffer descriptor?


regards

-- 
Tomas Vondra
      |          
       --91.21%--btgettuple
                 |          
                 |--58.16%--_bt_first
                 |          |          
                 |          |--41.47%--_bt_search
                 |          |          |          
                 |          |           --41.07%--_bt_relandgetbuf
                 |          |                     |          
                 |          |                     |--39.39%--ReadBufferExtended
                 |          |                     |          StartReadBuffer
                 |          |                     |          |          
                 |          |                     |           
--38.46%--PinBuffer
                 |          |                     |                     |       
   
                 |          |                     |                     
|--29.14%--WaitBufHdrUnlocked (inlined)
                 |          |                     |                     |       
   
                 |          |                     |                      
--8.83%--pg_atomic_compare_exchange_u32 (inlined)
                 |          |                     |                             
   pg_atomic_compare_exchange_u32_impl (inlined)
                 |          |                     |          
                 |          |                      --1.63%--_bt_lockbuf 
(inlined)
                 |          |                                LWLockAcquire
                 |          |                                |          
                 |          |                                 
--1.62%--LWLockAttemptLock (inlined)
                 |          |                                           |       
   
                 |          |                                            
--1.37%--pg_atomic_compare_exchange_u32 (inlined)
                 |          |                                                   
   pg_atomic_compare_exchange_u32_impl (inlined)
                 |          |          
                 |           --16.51%--_bt_readfirstpage
                 |                     |          
                 |                     |--15.45%--_bt_readpage
                 |                     |          |          
                 |                     |          |--14.29%--BufferGetLSNAtomic
                 |                     |          |          |          
                 |                     |          |           
--13.86%--LockBufHdr
                 |                     |          |          
                 |                     |           --0.67%--BufferGetBlockNumber
                 |                     |          
                 |                      --1.06%--LWLockRelease
                 |                                LWLockReleaseInternal
                 |                                pg_atomic_sub_fetch_u32 
(inlined)
                 |                                pg_atomic_sub_fetch_u32_impl 
(inlined)
                 |                                pg_atomic_fetch_sub_u32_impl 
(inlined)
                 |          
                  --33.05%--_bt_next
                            |          
                             --33.03%--_bt_steppage
                                       |          
                                       |--32.41%--UnpinBufferNoOwner
                                       |          |          
                                       |           
--7.30%--pg_atomic_compare_exchange_u32 (inlined)
                                       |                     
pg_atomic_compare_exchange_u32_impl (inlined)
                                       |          
                                        --0.61%--ReleaseBuffer
                                                  UnpinBuffer (inlined)
                                                  BufferDescriptorGetBuffer 
(inlined)

Reply via email to