Re: pg_stat_replication.*_lag sometimes shows NULL during active replication

Chao Li Thu, 12 Mar 2026 19:16:50 -0700

> On Mar 12, 2026, at 23:27, Fujii Masao <[email protected]> wrote:
> 
> On Wed, Mar 11, 2026 at 11:39 AM Shinya Kato <[email protected]> wrote:
>> 
>> On Tue, Mar 10, 2026 at 10:54 AM Fujii Masao <[email protected]> wrote:
>>> Even with your latest patch, if we remove fullyAppliedLastTime, and set
>>> clearLagTimes to true when applyPtr == sentPtr && noLagSamples &&
>>> positionsUnchanged,
>>> wouldn't the time for the lag to become NULL be almost the same as
>>> wal_receiver_status_interval?
>>> 
>>> The documentation doesn't clearly specify how long it should take for
>>> the lag to become NULL, so doubling that time might be acceptable.
>>> However, if we can keep it roughly the same without much complexity,
>>> I think that would be preferable.
>>> 
>>> Thought?
>> 
>> Thank you for the suggestion. I tested this by removing
>> fullyAppliedLastTime, but even with synchronous replication, NULL
>> still appears. Here is why:
>> 
>> - Reply 1 (flush notification): positions = X. Lag samples are
>> consumed with real values, so noLagSamples = false. clearLagTimes is
>> not set, and prevPtrs = X is saved.
>> 
>> - Reply 2 (force_reply): positions = X again. Here, noLagSamples =
>> true and positionsUnchanged = true. Since applyPtr == sentPtr,
>> clearLagTimes is set to true, resulting in a NULL value.
>> 
>> Therefore, I believe fullyAppliedLastTime is still necessary to ensure
>> that the previous reply also contained no lag samples.
> 
> Thanks for testing and for the clarification! You're right.
> 
> However, if we apply this change, the time required for the lag information to
> be reset would effectively double. I start wondering if that's really
> acceptable, especially for back branches. Although the docs doesn't clearly
> specify this timing, doubling it could affect systems that monitor
> replication lag, for example. It might still be reasonable to apply
> such a change in master, though.
> 
> On further thought, the root cause seems to be that walreceiver can send
> two consecutive status reply messages with identical WAL locations even
> when wal_receiver_status_interval has not yet elapsed. Addressing that
> behavior directly might resolve the issue you reported. I've attached a PoC
> patch that does this. Thought?
> 
> Regards,
> 
> -- 
> Fujii Masao
> <v4-0001-Avoid-sending-duplicate-WAL-locations-in-standby-.patch>

I just read v4. The approach looks good to me overall. I have a few comments 
about the naming.

This patch changes the old force reply logic to an applied-location-driven 
reply. Now a reply is sent only if the applied location has advanced. However, 
this applied-location-driven reply is still triggered from WalRcvForceReply(), 
so the function has effectively lost its original “force” semantics. Because of 
that, it might be better to rename WalRcvForceReply() to something like 
WalRcvRequestApplyStatusUpdate().

Then,
```
static void
XLogWalRcvSendReply(bool force, bool requestReply, bool replyApply)
```

replyApply reads like “send an apply reply”, but in reality it indicates that 
the applied location should be checked to decide whether to send the reply. So 
it might be clearer to rename it to something like checkApplyStatus.

Lastly,
```
    sig_atomic_t reply_apply; /* used as a bool */
```

reply_apply sounds like an action of “reply with apply”, but what it actually 
represents is that the startup process requested an applied-location-driven 
reply. If applied location is not advanced, the reply won’t be sent. So a name 
like apply_update_requested might better reflect the meaning.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: pg_stat_replication.*_lag sometimes shows NULL during active replication

Reply via email to