On Thu, Jan 18, 2024 at 02:20:28PM +0000, Bertrand Drouvot wrote: > On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote: >> I'm not sure if it >> can happen at all, but I think we can rely on previous conflict reason >> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn. > > I'm not sure what you mean here. I think we should still keep the "initial" > LSN > so that the next check done with it still makes sense. The previous conflict > reason as you're proposing also makes sense to me but for another reason: > PANIC > in case the issue still happen (for cases we did not think about, means not > covered by what the added previous LSNs are covering).
Using a PANIC seems like an overreaction to me for this path. Sure, we should not record twice a conflict reason, but wouldn't an assertion be more adapted enough to check that a termination is not logged twice for this code path run in the checkpointer? + if (!terminated) + { + initial_restart_lsn = s->data.restart_lsn; + initial_effective_xmin = s->effective_xmin; + initial_catalog_effective_xmin = s->effective_catalog_xmin; + } It seems important to document why we are saving this data here; while we hold the LWLock for repslots, before we perform any termination, and we want to protect conflict reports with incorrect data if the slot data got changed while the lwlock is temporarily released while there's a termination. No need to mention the pesky standby snapshot records, I guess, as there could be different reasons why these xmins advance. >> 3. Is there a way to reproduce this racy issue, may be by adding some >> sleeps or something? If yes, can you share it here, just for the >> records and verify the whatever fix provided actually works? > > Alexander was able to reproduce it on a slow machine and the issue was not > there > anymore with v1 in place. I think it's tricky to reproduce as it would need > the > slot to advance between the 2 checks. I'd really want a test for that in the long term. And an injection point stuck between the point where ReplicationSlotControlLock is released then re-acquired when there is an active PID should be enough, isn't it? For example just after the kill()? It seems to me that we should stuck the checkpointer, perform a forced upgrade of the xmins, release the checkpointer and see the effects of the change in the second loop. Now, modules/injection_points/ does not allow that, yet, but I've posted a patch among these lines that can stop a process and release it using a condition variable (should be 0006 on [1]). I was planning to start a new thread with a patch posted for the next CF to add this kind of facility with a regression test for an old bug, the patch needs a few hours of love, first. I should be able to post that next week. [1]: https://www.postgresql.org/message-id/caexhw5uwp9rmct9ba91bufknqeurosawtmens4ah2puyzv2...@mail.gmail.com -- Michael
signature.asc
Description: PGP signature