On Tue, Mar 3, 2026 at 1:12 PM Zhijie Hou (Fujitsu) <[email protected]> wrote: > > On Saturday, February 28, 2026 1:03 PM Amit Kapila <[email protected]> > wrote: > > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]> wrote: > > > > > > Normally, the slotsync worker updates the standby slot using the > > > primary's slot state. However, when confirmed_flush_lsn matches but > > > restart_lsn does not, the worker does not actually update the standby > > > slot. Despite that, the current code of update_local_synced_slot() > > > appears to treat this situation as if an update occurred. As a result, > > > the worker sleeps only for the minimum interval (200 ms) before > > > retrying. In the next cycle, it again assumes an update happened, and > > > continues looping with the short sleep interval, causing the repeated > > > logical decoding log messages. Based on a quick analysis, this seems to be > > the root cause. > > > > > > I think update_local_synced_slot() should return false (i.e., no > > > update > > > happened) when confirmed_flush_lsn is equal but restart_lsn differs > > > between primary and standby. > > > > > > > We expect that in such a case update_local_synced_slot() should advance > > local_slot's 'restart_lsn' via LogicalSlotAdvanceAndCheckSnapState(), > > otherwise, it won't go in the cheap code path next time. Normally, > > restart_lsn > > advancement should happen when we process XLOG_RUNNING_XACTS and > > call SnapBuildProcessRunningXacts(). In this particular case as both > > restart_lsn and confirmed_flush_lsn are the same (0/03000140), the > > machinery may not be processing XLOG_RUNNING_XACTS record. I have not > > debugged the exact case yet but you can try by emitting some more records > > on publisher, it should let the standby advance the slot. It is possible > > that we > > can do something like you are proposing to silence the LOG messages but we > > should know what is going on here. > > I reproduced and debugged this issue where a replication slot's restart_lsn > fails to advance. In my environment, I found it only occurs when a synced > slot first builds a consistent snapshot. The problematic code path is in > SnapBuildProcessRunningXacts(): > > if (builder->state < SNAPBUILD_CONSISTENT) > { > /* returns false if there's no point in performing cleanup just yet */ > if (!SnapBuildFindSnapshot(builder, lsn, running)) > return; > } > > When a synced slot reaches consistency for the first time with no running > transactions, SnapBuildFindSnapshot() returns false, causing the function to > return without updating the candidate restart_lsn. > > So, an alternative approach is to improve this logic by updating the candidate > restart_lsn in this case instead of returning early. >
But why not return 'true' from SnapBuildFindSnapshot() in that case? The comment atop SnapBuildFindSnapshot() says: "Returns true if there is a point in performing internal maintenance/cleanup using the xl_running_xacts record.". Doesn't updating restart_lsn fall under that category? However, I have a question that even if we haven't incremented it in the first cycle, why is it not incrementing restart_lsn in consecutive sync cycles. -- With Regards, Amit Kapila.
