On Thu, Dec 04, 2025 at 12:00:23PM +0500, Kirill Reshke wrote: > Your tests are using a busy loop with usleep which nowadays is > considered as bad practice. There 3 of such places, and I think the > first two of them > can be replaced with injection point wait.
(This fell off my radar, apologies about that.) The problem with this test is not related to its use of sleeps, which is perfectly fine to check the start or end timings of a checkpoint depending on the contents of the logs. It has two issues. One first issue is that the test is unnecessary long, taking more than 30s to finish because it relies on checkpoint_timeout to kick a checkpoint. This could use a background psql object to kick a checkpoint to accelerate the whole. So the test is sitting idle for a long time, doing nothing. Your intuition about injection points is right, though, but it points to a second problem: the test is not reliable because we could finish the checkpoint *before* the WAL segment is switched, and we expect the WAL segment switch to happen while the checkpoint is processing. If you want to make that deterministic, having a wait in the middle of checkpointing would make the test actually test what it should. In this case the test would randomly die on its "redo record wal file is the same" message. That's OK to prove the point of the initial patch, but it's not acceptable for a test that could be added to the tree. An update of src/test/recovery/meson.build to add the new test is also required. Anyway, I think that the patch is overdoing it in issuing a PANIC in this case, going backwards with the other thread from [0]: a FATAL would be perfectly OK, like in the backup_label path because your manipulations of WAL segment missing are indeed possible, as the test posted is proving. And there have been many arguments in the past about performing recovery without a backup_label, as well. And that would make the test something that we could use, because no backtrace on buildfarm hosts or disk bloat. If we do not re-check in InitWalRecovery() that the redo record is around, the startup process happily goes down to PerformWalRecovery() in an everything-is-fine mode, initializing a bunch of states, to then crash due to a pointer dereference while attempting to read the record from the redo LSN, which does not exist. This is not right. So, oops. I would see an argument good enough for a backpatch when it comes to crash recovery, because we actually don't know what has happened in this case. Or not based on the lack of complaints on the matter over the years. So I'd argue about updating the patch among these lines, instead: - Switch the PANIC to a FATAL, to inform about the pilot error. This addresses the pointer dereference with WAL replay going crazy for the redo LSN. - Add a test, with an injection point after checkpoint startup, based on a manual checkpoint done in a backgroud psql, without a checkpoint_timeout to make the test faster. - We could be careful first based on the lack of complaints, doing that extra check only on HEAD, for v19. > 4) There are comments from Robert, which are not covered [1]. > > [0] > https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de > [1] > https://www.postgresql.org/message-id/CA%2BTgmob1x7HMcWAb%3D6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg%40mail.gmail.com I get the argument about the spaghetti code in general, however the code in question here deals with the WAL replay initialization, for a backup_label vs a non-backup_label path. Perhaps I'm more used to this area than others, but it's not that much pasta to me. I don't see a point in moving the memcpy() and the wasShutdown parts as proposed in the patch, by the way. The PANIC would block things in its else block. Let's limit outselves to the extra ReadRecord() check for the redo record when we find a checkpoint record. I'd actually wonder if we should not lower the existing PANIC to a FATAL if ReadCheckpointRecord() fails, as well. The result would be the same for operators, without the need to deal with backtrace cleanups after a crash. And we still have the error message that tells us what's going on. Any thoughts or opinions from others? -- Michael
signature.asc
Description: PGP signature
