Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

Alexander Korotkov Mon, 23 Jun 2025 11:51:41 -0700

On Mon, Jun 23, 2025 at 8:04 PM vignesh C <[email protected]> wrote:
>
> On Mon, 23 Jun 2025 at 04:36, Alexander Korotkov <[email protected]> wrote:
> >
> > On Fri, Jun 20, 2025 at 2:24 PM vignesh C <[email protected]> wrote:
> > > On Fri, 20 Jun 2025 at 05:54, Alexander Korotkov <[email protected]> 
> > > wrote:
> > > > Dear Kuroda-san,
> > > >
> > > > On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu)
> > > > <[email protected]> wrote:
> > > > > > > Regarding assertion failure, I've found that assert in
> > > > > > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn
> > > > > > > previously set by ReplicationSlotReserveWal().  As I can see,
> > > > > > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr 
> > > > > > > lsn.
> > > > > > > So, it doesn't seems there is a guarantee that restart_lsn never 
> > > > > > > goes
> > > > > > > backward.  The commit in ReplicationSlotReserveWal() even states 
> > > > > > > there
> > > > > > > is a "chance that we have to retry".
> > > > > > >
> > > > > >
> > > > > > I don't see how this theory can lead to a restart_lsn of a slot 
> > > > > > going
> > > > > > backwards. The retry mentioned there is just a retry to reserve the
> > > > > > slot's position again if the required WAL is already removed. Such a
> > > > > > retry can only get the position later than the previous restart_lsn.
> > > > >
> > > > > We analyzed the assertion failure happened at 
> > > > > pg_basebackup/020_pg_receivewal,
> > > > > and confirmed that restart_lsn can go backward. This meant that 
> > > > > Assert() added
> > > > > by the ca307d5 can cause crash.
> > > > >
> > > > > Background
> > > > > ===========
> > > > > When pg_receivewal starts the replication and it uses the replication 
> > > > > slot, it
> > > > > sets as the beginning of the segment where restart_lsn exists, as the 
> > > > > startpoint.
> > > > > E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal 
> > > > > requests WALs
> > > > > from 0/B00000.
> > > > > More detail of this behavior, see f61e1dd2 and d9bae531.
> > > > >
> > > > > What happened here
> > > > > ==================
> > > > > Based on above theory, walsender sent from the beginning of segment 
> > > > > (0/B00000).
> > > > > When walreceiver receives, it tried to send reply. At that time the 
> > > > > flushed
> > > > > location of WAL would be 0/B00000. walsender sets the received lsn as 
> > > > > restart_lsn
> > > > > in PhysicalConfirmReceivedLocation(). Here the restart_lsn went 
> > > > > backward (0/B000D0->0/B00000).
> > > > >
> > > > > The assertion failure could happen if CHECKPOINT happened at that 
> > > > > time.
> > > > > Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the 
> > > > > data.restart_lsn
> > > > > was 0/B00000. It could not satisfy the assertion added in 
> > > > > InvalidatePossiblyObsoleteSlot().
> > > >
> > > > Thank you for your detailed explanation!
> > > >
> > > > > Note
> > > > > ====
> > > > > 1.
> > > > > In this case, starting from the beginning of the segment is not a 
> > > > > problem, because
> > > > > the checkpoint process only removes WAL files from segments that 
> > > > > precede the
> > > > > restart_lsn's wal segment. The current segment (0/B00000) will not be 
> > > > > removed,
> > > > > so there is no risk of data loss or inconsistency.
> > > > >
> > > > > 2.
> > > > > A similar pattern applies to pg_basebackup. Both use logic that 
> > > > > adjusts the
> > > > > requested streaming position to the start of the segment, and it 
> > > > > replies the
> > > > > received LSN as flushed.
> > > > >
> > > > > 3.
> > > > > I considered the theory above, but I could not reproduce 
> > > > > 040_standby_failover_slots_sync
> > > > > because it is a timing issue. Have someone else reproduced?
> > > > >
> > > > > We are still investigating failure caused at 
> > > > > 040_standby_failover_slots_sync.
> > > >
> > > > I didn't manage to reproduce this.  But as I see from the logs [1] on
> > > > mamba that START_REPLICATION command was issued just before assert
> > > > trap.  Could it be something similar to what I described in [2].
> > > > Namely:
> > > > 1. ReplicationSlotReserveWal() sets restart_lsn for the slot.
> > > > 2. Concurrent checkpoint flushes that restart_lsn to the disk.
> > > > 3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to
> > > > the beginning of the segment.
> > >
> > > Here is my analysis for the 040_standby_failover_slots_sync test
> > > failure where the physical replication slot can point to backward lsn:
> > > In certain rare cases the restart LSN can go backwards. This scenario
> > > can be reproduced by performing checkpoint continuously and slowing
> > > the WAL applying. I have a patch with changes to handle this.
> > > Here's a summary of the sequence of events:
> > > 1) Standby confirms a new LSN (0/40369C8) when primary sends some WAL 
> > > contents:
> > > After the standby writes the received WAL contents in XLogWalRcvWrite,
> > > the standby sends this lsn 0/40369C8 as the confirmed flush location.
> > > The primary acknowledges this and updates the replication slot's
> > > restart_lsn accordingly:
> > > 2025-06-20 14:33:21.777 IST [134998] standby1 LOG:
> > > PhysicalConfirmReceivedLocation replication slot "sb1_slot" set
> > > restart_lsn to 0/40369C8
> > > 2025-06-20 14:33:21.777 IST [134998] standby1 STATEMENT:
> > > START_REPLICATION SLOT "sb1_slot" 0/3000000 TIMELINE 1
> > >
> > > 2) Shortly after receiving the new LSN, a checkpoint occurs which
> > > saves this restart_lsn:
> > > 2025-06-20 14:33:21.780 IST [134913] LOG:  checkpoint complete: wrote
> > > 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0
> > > removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync
> > > files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0
> > > kB; lsn=0/4036A20, redo lsn=0/40369C8
> > >
> > > 3) Streaming replication is restarted because of primary_conninfo guc
> > > change and reload
> > > The WAL receiver process is restarted:
> > > 25-06-20 14:33:21.779 IST [134997] FATAL: terminating walreceiver
> > > process due to administrator command
> > >
> > > 4) Standby sends an older flush pointer after restart:
> > > Upon restart, the WAL receiver sends a flush location (0/401D448)
> > > derived from XLogRecoveryCtl->lastReplayedEndRecPtr, which is older
> > > than the previously confirmed restart_lsn. It is important to note
> > > that we are sending the lastReplayedEndRecPtr which is the last
> > > successfully replayed lsn in this case:
> > > 5-06-20 14:33:21.796 IST [135135] LOG:  WalReceiverMain
> > > LogstreamResult.Flush initialized to 0/401D448
> > > 2025-06-20 14:33:21.796 IST [135135] LOG:  sending write 0/401D448
> > > flush 0/401D448 apply 0/401D448
> > >
> > > This is done from here:
> > > ....
> > > /* Initialize LogstreamResult and buffers for processing messages */
> > > LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
> > >
> > > initStringInfo(&reply_message);
> > >
> > > /* Initialize nap wakeup times. */
> > > now = GetCurrentTimestamp();
> > > for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
> > > WalRcvComputeNextWakeup(i, now);
> > >
> > > /* Send initial reply/feedback messages. */
> > > XLogWalRcvSendReply(true, false);
> > > ...
> > >
> > > In case of step 1, we are sending the lsn of the WAL that is written,
> > > since we have slowed down the WAL applying the replay location is
> > > lesser.
> > >
> > > 5) I have added logs to detect this inconsistency:
> > > This leads to a scenario where the standby tries to confirm a
> > > restart_lsn older than the one currently held by the primary:
> > > 2025-06-20 14:33:21.797 IST [135136] standby1 LOG:  crash scenario -
> > > slot sb1_slot, cannot confirm a restart LSN (0/401D448) that is older
> > > than the current one (0/40369C8)
> > >
> > > If a checkpoint happens concurrently during this condition, it may
> > > trigger an assertion failure on the primary due to the restart_lsn
> > > being less than the last_saved_restart_lsn.
> > > Currently this does not break physical replication, but I'm not sure
> > > if the gap increases to many WAL files and if the WAL files get
> > > deleted, how it will behave.
> > >
> > > Attached the restart_lsn_backup_repro_v1 patch changes with which you
> > > can reproduce the 040_standby_failover_slots_sync failure. grep for
> > > "crash scenario - slot sb1_slot" in the logs. For me it occurs with
> > > every run. The reproduced logs
> > > 040_standby_failover_slots_sync_publisher and
> > > 040_standby_failover_slots_sync_standby1.log are attached.
> > >
> > > This proves that the restart_lsn can go backward in cases where the
> > > standby is slowly applying. But this has nothing to do with this
> > > thread, I felt you can commit the assert removal patch. I will
> > > continue the analysis further and see if there is any impact or not on
> > > physical replication, and we can later add comments accordingly.
> >
> > I think this thread now have well-documented cases when restart_lsn
> > goes backward.  I suppose the first thing we should do is to remove
> > the assert to make buildfarm stop failing on this.  I've prepared a
> > patch with revised commit message.  Then we can focus on documenting
> > that correctly in the code comments.
>
> +1 to proceed this way, one minor suggestion we can include
> pg_basebackup also in the commit message as I noticed this can happen
> with pg_basebackup too.


Thank you!  Pushed including your suggestion to mention pg_basebackup.

------
Regards,
Alexander Korotkov
Supabase

Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly

Reply via email to