On Tue, May 13, 2025 at 4:22 PM Dilip Kumar <dilipbal...@gmail.com> wrote: > > On Tue, May 13, 2025 at 3:48 PM shveta malik <shveta.ma...@gmail.com> wrote: > > > > Hi All, > > > > It is a spin-off thread from earlier discussions at [1] and [2]. > > > > While analyzing the slot-sync BF failure as stated in [1], it was > > observed that there are chances that confirmed_flush_lsn may move > > backward depending on the feedback messages received from the > > downstream system. It was suspected that the backward movement of > > confirmed_flush_lsn may result in data duplication issues. Earlier we > > were able to successfully reproduce the issue with two_phase enabled > > subscriptions (see[2]). Now on further analysing, it seems possible > > that data duplication issues may happen without two-phase as well. > > Thanks for the detailed explanation. Before we focus on patching the > symptoms, I’d like to explore whether the issue can be addressed on > the subscriber side. Specifically, have we analyzed if there’s a way > to prevent the subscriber from moving the LSN backward in the first > place? That might lead to a cleaner and more robust solution overall. >
The subscriber doesn't move the LSN backwards, it only shares the information with the publisher, which is the latest value of remote LSN tracked by the origin. Now, as explained in email [1], the subscriber doesn't persistently store/advance the LSN, for which it doesn't have to do anything like DDLs, or any other non-published DMLs. However, subscribers need to send confirmation of such LSNs for synchronous replication. This is commented in the code as well, see comments in CreateDecodingContext (It might seem like we should error out in this case, but it's pretty common for a client to acknowledge a LSN it doesn't have to do anything for ...). As mentioned in email[1], persisting the LSN information that the subscriber doesn't have to do anything with could be a noticeable performance overhead. I think it is better to deal with this in the publisher by not allowing it to move confirm_flush LSN backwards, as Shveta proposed. [1]: https://www.postgresql.org/message-id/CAA4eK1%2BzWQwOe5G8zCYGvErnaXh5%2BDbyg_A1Z3uywSf_4%3DT0UA%40mail.gmail.com -- With Regards, Amit Kapila.