Re: Assertion failure in SnapBuildInitialSnapshot()

Masahiko Sawada Tue, 30 Dec 2025 10:53:46 -0800

On Mon, Dec 29, 2025 at 7:55 PM Chao Li <[email protected]> wrote:
>
>
>
> > On Dec 30, 2025, at 06:14, Masahiko Sawada <[email protected]> wrote:
> >
> > On Thu, Dec 18, 2025 at 7:19 PM Zhijie Hou (Fujitsu)
> > <[email protected]> wrote:
> >>
> >> On Friday, December 19, 2025 3:42 AM Masahiko Sawada 
> >> <[email protected]> wrote:
> >>>
> >>> On Tue, Dec 9, 2025 at 7:32 PM Zhijie Hou (Fujitsu) 
> >>> <[email protected]>
> >>> wrote:
> >>>>
> >>>> On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada
> >>> <[email protected]> wrote:
> >>>>>
> >>>>> On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada
> >>>>> <[email protected]> wrote:
> >>>>>>> Right. But the following scenario seems to happen:
> >>>>>>>
> >>>>>>> 1. Both processes have a slot with effective_catalog_xmin = 100.
> >>>>>>> 2. Process-A updates effective_catalog_xmin to 150, and computes
> >>>>>>> the
> >>>>> new
> >>>>>>> catalog_xmin as 100 because process-B slot still has
> >>>>> effective_catalog_xmin =
> >>>>>>> 100.
> >>>>>>> 3. Process-B updates effective_catalog_xmin to 150, and computes
> >>>>>>> the
> >>>>> new
> >>>>>>> catalog_xmin as 150.
> >>>>>>> 4. Process-B updates procArray->replication_slot_catalog_xmin to
> >>> 150.
> >>>>>>> 5. Process-A updates procArray->replication_slot_catalog_xmin to
> >>> 100.
> >>>>>>
> >>>>>> I think this scenario can occur, but is not harmful. Because the
> >>>>>> catalog rows removed prior to xid:150 would no longer be used, as
> >>>>>> both slots have
> >>>>> advanced
> >>>>>> their catalog_xmin and flushed the value to disk. Therefore, even
> >>>>>> if replication_slot_catalog_xmin regresses, it should be OK.
> >>>>>>
> >>>>>> Considering all above, I think allowing concurrent xmin
> >>>>>> computation, as the patch does, is acceptable. What do you think ?
> >>>>>
> >>>>> I agree with your analysis. Another thing I'd like to confirm is
> >>>>> that in an extreme case, if the server crashes suddenly after
> >>>>> removing catalog tuples older than XID 100 and logical decoding
> >>>>> restarts, it ends up missing necessary catalog tuples? I think it's
> >>>>> not a problem as long as the subscriber knows the next commit LSN
> >>>>> they want but could it be problematic if the user switches to use
> >>>>> the logical decoding SQL API? I might be worrying too much, though.
> >>>>
> >>>> I think this case is not a problem because:
> >>>>
> >>>> In LogicalConfirmReceivedLocation, the updated restart_lsn and
> >>>> catalog_xmin are flushed to disk before the effective_catalog_xmin is
> >>>> updated. Thus, once replication_slot_catalog_xmin advances to a
> >>>> certain value, even in the event of a crash, users won't encounter any
> >>>> removed tuples when consuming from slots after a restart. This is
> >>>> because all slots have their updated restart_lsn flushed to disk,
> >>>> ensuring that upon restarting, changes are decoded from the updated
> >>> position where older catalog tuples are no longer needed.
> >>>
> >>> Agreed.
> >>>
> >>>>
> >>>> BTW, I assume you meant catalog tuples older than XID 150 are removed,
> >>>> since in the previous example, tuples older than XID 100 are already not
> >>> useful.
> >>>
> >>> Right. Thank you for pointing this out.
> >>>
> >>> I think we can proceed with the idea proposed by Hou-san. Regarding the
> >>> patch, while it mostly looks good, it might be worth considering adding 
> >>> more
> >>> comments:
> >>>
> >>> - If the caller passes already_locked=true to
> >>> ReplicationSlotsComputeRequiredXmin(), the lock order should also be
> >>> considered (must acquire RepliationSlotControlLock and then 
> >>> ProcArrayLock).
> >>> - ReplicationSlotsComputeRequiredXmin() can concurrently run by multiple
> >>> process, resulting in temporarily moving
> >>> procArray->replication_slot_catalog_xmin backward, but it's harmless
> >>> because a smaller catalog_xmin is conservative: it merely prevents VACUUM
> >>> from removing catalog tuples that could otherwise be pruned. It does not 
> >>> lead
> >>> to premature deletion of required data.
> >>
> >> Thanks for the comments. I added some more comments as suggested.
> >>
> >> Here is the updated patch.
> >
> > Thank you for updating the patch! The patch looks good to me.
> >
> > I've made minor changes to the comment and commit message and created
> > patches for backbranches. I'm going to push them, barring any
> > objections.
> >
> > Regards,
> >
> > --
> > Masahiko Sawada
> > Amazon Web Services: https://aws.amazon.com
> > <master_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_18_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_17_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_16_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_14_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_15_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_13_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch>
>
>
> I’ve just looked through the patch for master. The fix itself looks solid to 
> me. I only noticed a few minor comment nits:
>
> 1
> ```
> + * ProcArrayLock, to prevent any undetectable deadlocks since this function
> + * acquire them in that order.
> ```
>
> acquire -> acquires
>
> 2
> ```
> +        * values, so no backend update the initial xmin for newly created 
> slot
> ```
>
> Update -> updates
>
> 3
> ```
> +        * slot machinery about the new limit. Once that's done the both locks
> ```
>
> “The both locks”, feels like “the” is not needed.
>


Thank you for reviewing the patches! I'll incorporate these comments
before push.

I totally overlooked the fact that we no longer support PG13, so I'm
going to push them down to 14.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Assertion failure in SnapBuildInitialSnapshot()

Reply via email to