Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

Michael Paquier Wed, 14 Feb 2024 21:10:11 -0800

On Thu, Jan 18, 2024 at 02:20:28PM +0000, Bertrand Drouvot wrote:
> On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote:
>> I'm not sure if it
>> can happen at all, but I think we can rely on previous conflict reason
>> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn.
> 
> I'm not sure what you mean here. I think we should still keep the "initial" 
> LSN
> so that the next check done with it still makes sense. The previous conflict
> reason as you're proposing also makes sense to me but for another reason: 
> PANIC
> in case the issue still happen (for cases we did not think about, means not
> covered by what the added previous LSNs are covering).


Using a PANIC seems like an overreaction to me for this path.  Sure, 
we should not record twice a conflict reason, but wouldn't an
assertion be more adapted enough to check that a termination is not
logged twice for this code path run in the checkpointer?

+            if (!terminated)
+            {
+                initial_restart_lsn = s->data.restart_lsn;
+                initial_effective_xmin = s->effective_xmin;
+                initial_catalog_effective_xmin = s->effective_catalog_xmin;
+            }

It seems important to document why we are saving this data here; while
we hold the LWLock for repslots, before we perform any termination,
and we  want to protect conflict reports with incorrect data if the
slot data got changed while the lwlock is temporarily released while
there's a termination.  No need to mention the pesky standby snapshot
records, I guess, as there could be different reasons why these xmins
advance.

>> 3. Is there a way to reproduce this racy issue, may be by adding some
>> sleeps or something? If yes, can you share it here, just for the
>> records and verify the whatever fix provided actually works?
> 
> Alexander was able to reproduce it on a slow machine and the issue was not 
> there
> anymore with v1 in place. I think it's tricky to reproduce as it would need 
> the
> slot to advance between the 2 checks.

I'd really want a test for that in the long term.  And an injection
point stuck between the point where ReplicationSlotControlLock is
released then re-acquired when there is an active PID should be
enough, isn't it?  For example just after the kill()?  It seems to me
that we should stuck the checkpointer, perform a forced upgrade of the
xmins, release the checkpointer and see the effects of the change in
the second loop.  Now, modules/injection_points/ does not allow that,
yet, but I've posted a patch among these lines that can stop a process
and release it using a condition variable (should be 0006 on [1]).  I
was planning to start a new thread with a patch posted for the next CF
to add this kind of facility with a regression test for an old bug,
the patch needs a few hours of love, first.  I should be able to post
that next week.

[1]: 
https://www.postgresql.org/message-id/caexhw5uwp9rmct9ba91bufknqeurosawtmens4ah2puyzv2...@mail.gmail.com
--
Michael

signature.asc
Description: PGP signature

Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

Reply via email to