Hi Andres,

12.04.2024 07:41, Andres Freund wrote:

FWIW, I just reproduced the scenario with signals. I added tracking of the
total time actually slept and lost to SpinDelayStatus, and added a function to
trigger a wait on a spinlock.

To wait less, I set max_standby_streaming_delay=0.1, but that's just for
easier testing in isolation. In reality that could have been reached before
the spinlock is even acquired.

On a standby, while a recovery conflict is happening:
PANIC:  XX000: stuck spinlock detected at crashme, path/to/file:line, after 
4.38s, lost 127.96s


So right now it's really not hard to trigger the stuck-spinlock logic
completely spuriously.  This doesn't just happen with hot standby, there are
plenty other sources of lots of signals being sent.

I managed to trigger that logic when trying to construct a reproducer
for bug #18426.

With the following delays added:
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1776,6 +1776,7 @@ retry:
      */
     if (BUF_STATE_GET_REFCOUNT(buf_state) != 0)
     {
+pg_usleep(300000L);
         UnlockBufHdr(buf, buf_state);
         LWLockRelease(oldPartitionLock);
         /* safety check: should definitely not be our *own* pin */
@@ -5549,6 +5550,7 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, 
uint32 set_flag_bits,

     Assert(buf_state & BM_IO_IN_PROGRESS);

+pg_usleep(300);
     buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
     if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
         buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);

and /tmp/temp.config:
bgwriter_delay = 10

TEMP_CONFIG=/tmp/temp.config make -s check -C src/test/recovery 
PROVE_TESTS="t/032*"
fails for me on iterations 22, 23, 37:
2024-04-12 05:00:17.981 UTC [762336] PANIC:  stuck spinlock detected at 
WaitBufHdrUnlocked, bufmgr.c:5726

I haven't investigated this case yet.

Best regards,
Alexander


Reply via email to