Re: Assertion failure in SnapBuildInitialSnapshot()

Masahiko Sawada Mon, 29 Dec 2025 14:15:38 -0800

On Thu, Dec 18, 2025 at 7:19 PM Zhijie Hou (Fujitsu)
<[email protected]> wrote:
>
> On Friday, December 19, 2025 3:42 AM Masahiko Sawada <[email protected]> 
> wrote:
> >
> > On Tue, Dec 9, 2025 at 7:32 PM Zhijie Hou (Fujitsu) <[email protected]>
> > wrote:
> > >
> > > On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada
> > <[email protected]> wrote:
> > > >
> > > > On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada
> > > > <[email protected]> wrote:
> > > > > > Right. But the following scenario seems to happen:
> > > > > >
> > > > > > 1. Both processes have a slot with effective_catalog_xmin = 100.
> > > > > > 2. Process-A updates effective_catalog_xmin to 150, and computes
> > > > > > the
> > > > new
> > > > > > catalog_xmin as 100 because process-B slot still has
> > > > effective_catalog_xmin =
> > > > > > 100.
> > > > > > 3. Process-B updates effective_catalog_xmin to 150, and computes
> > > > > > the
> > > > new
> > > > > > catalog_xmin as 150.
> > > > > > 4. Process-B updates procArray->replication_slot_catalog_xmin to
> > 150.
> > > > > > 5. Process-A updates procArray->replication_slot_catalog_xmin to
> > 100.
> > > > >
> > > > > I think this scenario can occur, but is not harmful. Because the
> > > > > catalog rows removed prior to xid:150 would no longer be used, as
> > > > > both slots have
> > > > advanced
> > > > > their catalog_xmin and flushed the value to disk. Therefore, even
> > > > > if replication_slot_catalog_xmin regresses, it should be OK.
> > > > >
> > > > > Considering all above, I think allowing concurrent xmin
> > > > > computation, as the patch does, is acceptable. What do you think ?
> > > >
> > > > I agree with your analysis. Another thing I'd like to confirm is
> > > > that in an extreme case, if the server crashes suddenly after
> > > > removing catalog tuples older than XID 100 and logical decoding
> > > > restarts, it ends up missing necessary catalog tuples? I think it's
> > > > not a problem as long as the subscriber knows the next commit LSN
> > > > they want but could it be problematic if the user switches to use
> > > > the logical decoding SQL API? I might be worrying too much, though.
> > >
> > > I think this case is not a problem because:
> > >
> > > In LogicalConfirmReceivedLocation, the updated restart_lsn and
> > > catalog_xmin are flushed to disk before the effective_catalog_xmin is
> > > updated. Thus, once replication_slot_catalog_xmin advances to a
> > > certain value, even in the event of a crash, users won't encounter any
> > > removed tuples when consuming from slots after a restart. This is
> > > because all slots have their updated restart_lsn flushed to disk,
> > > ensuring that upon restarting, changes are decoded from the updated
> > position where older catalog tuples are no longer needed.
> >
> > Agreed.
> >
> > >
> > > BTW, I assume you meant catalog tuples older than XID 150 are removed,
> > > since in the previous example, tuples older than XID 100 are already not
> > useful.
> >
> > Right. Thank you for pointing this out.
> >
> > I think we can proceed with the idea proposed by Hou-san. Regarding the
> > patch, while it mostly looks good, it might be worth considering adding more
> > comments:
> >
> > - If the caller passes already_locked=true to
> > ReplicationSlotsComputeRequiredXmin(), the lock order should also be
> > considered (must acquire RepliationSlotControlLock and then ProcArrayLock).
> > - ReplicationSlotsComputeRequiredXmin() can concurrently run by multiple
> > process, resulting in temporarily moving
> > procArray->replication_slot_catalog_xmin backward, but it's harmless
> > because a smaller catalog_xmin is conservative: it merely prevents VACUUM
> > from removing catalog tuples that could otherwise be pruned. It does not 
> > lead
> > to premature deletion of required data.
>
> Thanks for the comments. I added some more comments as suggested.
>
> Here is the updated patch.


Thank you for updating the patch! The patch looks good to me.

I've made minor changes to the comment and commit message and created
patches for backbranches. I'm going to push them, barring any
objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

master_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch
Description: Binary data

REL_18_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch
Description: Binary data

REL_17_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch
Description: Binary data

REL_16_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch
Description: Binary data

REL_14_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch
Description: Binary data

REL_15_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch
Description: Binary data

REL_13_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch
Description: Binary data

Re: Assertion failure in SnapBuildInitialSnapshot()

Reply via email to