On Mon, Jun 23, 2025 at 3:29 AM Michael Paquier <mich...@paquier.xyz> wrote: > > Yeah, that's what I think too. The unintentional omission of a > > pre-shutdown delay in the 046 test has exposed some pre-existing > > fragility in pg_logical_slot_get_changes(). So I'm not in favor > > of changing 046 till we understand this better. > > Yes, better to understand what's going on before changing the test, > and perhaps change 046 so as it provide stable coverage for this case, > even if discovered accidentally. > > So, is it only pg_logical_slot_get_changes_guts() that's messed up in > this context, because XLogDecodeNextRecord() is trying to read pages > as it has a logic too permissive? How did any of you reproduce the > failure? Just by running the test in a loop? It is one of these > patterns where a hardcoded sleep should do the trick and help with a > bisect.
Personally I just run the test in the loop. It takes about ~30 times to reproduce. Works both with -O0 and -O2. cd src/test/recovery while PROVE_TESTS="t/046_checkpoint_logical_slot.pl t/ 047_checkpoint_physical_slot.pl" make check; do :; done > By the way, At the bottom of test 046_checkpoint_logical_slot.pl, if > you expect the checkpoint to complete before sending the immediate > shutdown request, I would suggest to use a wait_for_log() based on > "checkpoint complete" or an equivalent loop. What you are doing in > the test is unstable as written. Exactly. I've proposed the fix with wait_for_log() in [1]. Nevertheless, both cases (immediate stop before checkpoint completion, and immediate stop after checkpoint completion) must work without hang. Links. 1. https://www.postgresql.org/message-id/CAPpHfdurV-j_e0pb%3DUFENAy3tyzxfF%2ByHveNDNQk2gM82WBU5A%40mail.gmail.com ------ Regards, Alexander Korotkov Supabase