On Mon, Dec 29, 2025 at 7:55 PM Chao Li <[email protected]> wrote: > > > > > On Dec 30, 2025, at 06:14, Masahiko Sawada <[email protected]> wrote: > > > > On Thu, Dec 18, 2025 at 7:19 PM Zhijie Hou (Fujitsu) > > <[email protected]> wrote: > >> > >> On Friday, December 19, 2025 3:42 AM Masahiko Sawada > >> <[email protected]> wrote: > >>> > >>> On Tue, Dec 9, 2025 at 7:32 PM Zhijie Hou (Fujitsu) > >>> <[email protected]> > >>> wrote: > >>>> > >>>> On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada > >>> <[email protected]> wrote: > >>>>> > >>>>> On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu) > >>>>> <[email protected]> wrote: > >>>>>> > >>>>>> On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada > >>>>> <[email protected]> wrote: > >>>>>>> Right. But the following scenario seems to happen: > >>>>>>> > >>>>>>> 1. Both processes have a slot with effective_catalog_xmin = 100. > >>>>>>> 2. Process-A updates effective_catalog_xmin to 150, and computes > >>>>>>> the > >>>>> new > >>>>>>> catalog_xmin as 100 because process-B slot still has > >>>>> effective_catalog_xmin = > >>>>>>> 100. > >>>>>>> 3. Process-B updates effective_catalog_xmin to 150, and computes > >>>>>>> the > >>>>> new > >>>>>>> catalog_xmin as 150. > >>>>>>> 4. Process-B updates procArray->replication_slot_catalog_xmin to > >>> 150. > >>>>>>> 5. Process-A updates procArray->replication_slot_catalog_xmin to > >>> 100. > >>>>>> > >>>>>> I think this scenario can occur, but is not harmful. Because the > >>>>>> catalog rows removed prior to xid:150 would no longer be used, as > >>>>>> both slots have > >>>>> advanced > >>>>>> their catalog_xmin and flushed the value to disk. Therefore, even > >>>>>> if replication_slot_catalog_xmin regresses, it should be OK. > >>>>>> > >>>>>> Considering all above, I think allowing concurrent xmin > >>>>>> computation, as the patch does, is acceptable. What do you think ? > >>>>> > >>>>> I agree with your analysis. Another thing I'd like to confirm is > >>>>> that in an extreme case, if the server crashes suddenly after > >>>>> removing catalog tuples older than XID 100 and logical decoding > >>>>> restarts, it ends up missing necessary catalog tuples? I think it's > >>>>> not a problem as long as the subscriber knows the next commit LSN > >>>>> they want but could it be problematic if the user switches to use > >>>>> the logical decoding SQL API? I might be worrying too much, though. > >>>> > >>>> I think this case is not a problem because: > >>>> > >>>> In LogicalConfirmReceivedLocation, the updated restart_lsn and > >>>> catalog_xmin are flushed to disk before the effective_catalog_xmin is > >>>> updated. Thus, once replication_slot_catalog_xmin advances to a > >>>> certain value, even in the event of a crash, users won't encounter any > >>>> removed tuples when consuming from slots after a restart. This is > >>>> because all slots have their updated restart_lsn flushed to disk, > >>>> ensuring that upon restarting, changes are decoded from the updated > >>> position where older catalog tuples are no longer needed. > >>> > >>> Agreed. > >>> > >>>> > >>>> BTW, I assume you meant catalog tuples older than XID 150 are removed, > >>>> since in the previous example, tuples older than XID 100 are already not > >>> useful. > >>> > >>> Right. Thank you for pointing this out. > >>> > >>> I think we can proceed with the idea proposed by Hou-san. Regarding the > >>> patch, while it mostly looks good, it might be worth considering adding > >>> more > >>> comments: > >>> > >>> - If the caller passes already_locked=true to > >>> ReplicationSlotsComputeRequiredXmin(), the lock order should also be > >>> considered (must acquire RepliationSlotControlLock and then > >>> ProcArrayLock). > >>> - ReplicationSlotsComputeRequiredXmin() can concurrently run by multiple > >>> process, resulting in temporarily moving > >>> procArray->replication_slot_catalog_xmin backward, but it's harmless > >>> because a smaller catalog_xmin is conservative: it merely prevents VACUUM > >>> from removing catalog tuples that could otherwise be pruned. It does not > >>> lead > >>> to premature deletion of required data. > >> > >> Thanks for the comments. I added some more comments as suggested. > >> > >> Here is the updated patch. > > > > Thank you for updating the patch! The patch looks good to me. > > > > I've made minor changes to the comment and commit message and created > > patches for backbranches. I'm going to push them, barring any > > objections. > > > > Regards, > > > > -- > > Masahiko Sawada > > Amazon Web Services: https://aws.amazon.com > > <master_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_18_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_17_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_16_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_14_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_15_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_13_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch> > > > I’ve just looked through the patch for master. The fix itself looks solid to > me. I only noticed a few minor comment nits: > > 1 > ``` > + * ProcArrayLock, to prevent any undetectable deadlocks since this function > + * acquire them in that order. > ``` > > acquire -> acquires > > 2 > ``` > + * values, so no backend update the initial xmin for newly created > slot > ``` > > Update -> updates > > 3 > ``` > + * slot machinery about the new limit. Once that's done the both locks > ``` > > “The both locks”, feels like “the” is not needed. >
Thank you for reviewing the patches! I'll incorporate these comments before push. I totally overlooked the fact that we no longer support PG13, so I'm going to push them down to 14. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
