On Sat, Jun 21, 2025 at 1:40 AM Alexander Korotkov <aekorot...@gmail.com> wrote: > On Sat, Jun 21, 2025 at 1:25 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > > I wrote: > > > But in the buildfarm failures I don't see any 'checkpoint complete' > > > before the shutdown. > > > > Ooops, I lied: we have at least one case where the checkpoint does > > finish but then it hangs up anyway: > > > > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check > > > > 2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG: statement: select injection_points_wakeup('checkpoint-before-old-wal-removal') > > 2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s, total=1.507 s; sync files=1, longest=0.001 s, average=0.001 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068 > > 2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG: disconnection: session time: 0:00:00.073 user=bf database=postgres host=[local] > > 2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG: disconnection: session time: 0:00:01.538 user=bf database=postgres host=[local] > > 2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG: statement: SELECT 1 > > 2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG: received immediate shutdown request > > 2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG: database system is shut down > > > > I'm still confused about whether the 046 script intends to > > sometimes test cases where the checkpoint doesn't have time to > > complete. But it seems that whatever is going on here > > is a bit subtle and platform-dependent. > > I think this indicates unfinished intention to wait for checkpoint > completion. But I think both cases (checkpoint finished and > unfinished) should work correctly. So, I believe there is a backend > problem. I'm trying to reproduce this locally. Sorry for the > confusion.
I manage to reproduce the bug locally. I've got the following backtrace. * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x000000019e6fd3c8 libsystem_kernel.dylib`__semwait_signal + 8 frame #1: 0x000000019e5de568 libsystem_c.dylib`nanosleep + 220 frame #2: 0x0000000103009628 postgres`pg_usleep(microsec=1000) at pgsleep.c:50:10 frame #3: 0x00000001027bd248 postgres`read_local_xlog_page_guts(state=0x000000011383ed58, targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360, cur_page="\U00000018\xd1\U00000005", wait_for_wal=true) at xlogutils.c:950:4 frame #4: 0x00000001027bd130 postgres`read_local_xlog_page(state=0x000000011383ed58, targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360, cur_page="\U00000018\xd1\U00000005") at xlogutils.c:848:9 frame #5: 0x00000001027ae184 postgres`ReadPageInternal(state=0x000000011383ed58, pageptr=687882240, reqLen=277) at xlogreader.c:1061:12 frame #6: 0x00000001027acd14 postgres`XLogDecodeNextRecord(state=0x000000011383ed58, nonblocking=false) at xlogreader.c:727:14 frame #7: 0x00000001027ac6bc postgres`XLogReadAhead(state=0x000000011383ed58, nonblocking=false) at xlogreader.c:974:11 frame #8: 0x00000001027ac5c0 postgres`XLogReadRecord(state=0x000000011383ed58, errormsg=0x000000016d798dd0) at xlogreader.c:405:3 frame #9: 0x0000000102c1d1d0 postgres`pg_logical_slot_get_changes_guts(fcinfo=0x0000000113830700, confirm=true, binary=false) at logicalfuncs.c:257:13 frame #10: 0x0000000102c1ca00 postgres`pg_logical_slot_get_changes(fcinfo=0x0000000113830700) at logicalfuncs.c:333:9 frame #11: 0x00000001029e1adc postgres`ExecMakeTableFunctionResult(setexpr=0x0000000113825440, econtext=0x00000001138252e0, argContext=0x0000000113830600, expectedDesc=0x000000011382ecb0, randomAccess=false) at execSRF.c:234:13 While looking at frame 5 I found the following code to be executed. targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ); targetRecOff = RecPtr % XLOG_BLCKSZ; /* * Read the page containing the record into state->readBuf. Request enough * byte to cover the whole record header, or at least the part of it that * fits on the same page. */ readOff = ReadPageInternal(state, targetPagePtr, Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ)); And I see the following variable values. (lldb) p/x targetPagePtr (XLogRecPtr) 0x0000000029004000 (lldb) p/x RecPtr (XLogRecPtr) 0x0000000029002138 I hardly understand how is this possible given it was compiled with "-O0". I'm planning to continue investigating this tomorrow. ------ Regards, Alexander Korotkov Supabase