Re: Allow reading LSN written by walreciever, but not flushed yet

Andrey Borodin Wed, 21 May 2025 08:38:41 -0700

> On 21 May 2025, at 15:03, Fujii Masao <masao.fu...@oss.nttdata.com> wrote:
> 
> 
> 
> On 2025/05/21 17:35, Andrey Borodin wrote:
>> Well, we implemented this and made tests that do a lot of failovers. These 
>> tests observed data loss in some infrequent cases due to wrong new primary 
>> selection. Because "few seconds" is actually unknown random time.
> 
> I see your point. But doesn't a similar issue exist even with the write LSN?
> For example, even if node1's write LSN is ahead of node2's at one moment,
> node2 might catch up or surpass it a few seconds later.
> 
> If the walreceiver is no longer running, we can assume the write LSN has
> reached its final value. So by waiting for the walreceiver to exit on both 
> nodes,
> we can "safely" compare their write LSNs to decide which one is ahead.
> Also, in this situation, since XLogWalRcvFlush() is called during WalRcvDie(),
> the flush LSN seems effectively guaranteed to match the write LSN.
> So it seems also safe to use the flush LSN.

You are right. Receive LSN is meaningless when receive is in progress. So the 
only way to know receive LSN is to stop receiving...
I need to think more about it.

>>>>>> Caveat: we already have a function pg_last_wal_receive_lsn(), which in 
>>>>>> fact returns flushed LSN, not written. I propose to add a new function 
>>>>>> which returns LSN actually written. Internals of this function are 
>>>>>> already implemented (GetWalRcvWriteRecPtr()), but unused.
>>> 
>>> GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward
>>> when the walreceiver restarts. This behavior is OK for your purpose?
>> It is OK, because:
>> 1. It's strictly no worse than flushed LSN
> 
> Could you clarify this?
> 
> XLogWalRcvFlush() only updates flushedUpto if LogstreamResult.Flush has 
> advanced,
> while XLogWalRcvWrite() updates writtenUpto unconditionally. That means the 
> flush
> LSN (as reported by pg_last_wal_receive_lsn()) never moves backward, whereas
> the write LSN might.

Write LSN cannot move backwards beyond flush LSN. Receive LSN >= flush LSN.

> Because of this difference in behavior, I was thinking that
> we might need to track the maximum write LSN seen so far and have the function
> return that value.

That would be ideal. Or, maybe just maximum LSN that we told Primary we have 
received...


Best regards, Andrey Borodin.
Re: Allow reading LSN written by walreciever, but not flushed yet

Reply via email to