On 2025/05/21 17:35, Andrey Borodin wrote:
Well, we implemented this and made tests that do a lot of failovers. These tests observed 
data loss in some infrequent cases due to wrong new primary selection. Because "few 
seconds" is actually unknown random time.

I see your point. But doesn't a similar issue exist even with the write LSN?
For example, even if node1's write LSN is ahead of node2's at one moment,
node2 might catch up or surpass it a few seconds later.

If the walreceiver is no longer running, we can assume the write LSN has
reached its final value. So by waiting for the walreceiver to exit on both 
nodes,
we can "safely" compare their write LSNs to decide which one is ahead.
Also, in this situation, since XLogWalRcvFlush() is called during WalRcvDie(),
the flush LSN seems effectively guaranteed to match the write LSN.
So it seems also safe to use the flush LSN.


Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact 
returns flushed LSN, not written. I propose to add a new function which returns 
LSN actually written. Internals of this function are already implemented 
(GetWalRcvWriteRecPtr()), but unused.

GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward
when the walreceiver restarts. This behavior is OK for your purpose?

It is OK, because:
1. It's strictly no worse than flushed LSN

Could you clarify this?

XLogWalRcvFlush() only updates flushedUpto if LogstreamResult.Flush has 
advanced,
while XLogWalRcvWrite() updates writtenUpto unconditionally. That means the 
flush
LSN (as reported by pg_last_wal_receive_lsn()) never moves backward, whereas
the write LSN might. Because of this difference in behavior, I was thinking that
we might need to track the maximum write LSN seen so far and have the function
return that value.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Reply via email to