Hi Jim, On Sun, Mar 29, 2026 at 10:56 AM Jim Jones <[email protected]> wrote:
> Hi, > > When connecting with target_session_attrs=standby (or prefer-standby, > read-only, any) and multiple standbys are available, libpq currently > selects the first acceptable candidate without regard for how "current" > its data is. A standby configured with recovery_min_apply_delay, > experiencing slow I/O, or otherwise lagging is treated the same as one > that is fully caught up. > > I would like to propose a new libpq connection parameter, > max_wal_replay_size, that allows clients to skip standby servers whose > WAL replay backlog exceeds a given threshold. > > Example: > > psql "host=host1,host2,host3 port=5111,5222,5333 \ > target_session_attrs=standby max_wal_replay_size=16MB" > > When this parameter is set, libpq executes a small query during > connection establishment to evaluate: > > pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) > > on the standby. If the result exceeds the specified threshold, the > server is skipped and the next host in the list is tried. The check is > skipped entirely when target_session_attrs is set to primary or > read-write, since those modes already exclude standbys. > What if none of them meets the criteria? You fail the connection? Wouldn't it cause an availability issue? > > If pg_last_wal_receive_lsn() is NULL (e.g. no active WAL receiver due to > missing primary_conninfo or a disconnected upstream), the backlog cannot > be determined. In that case, the standby is treated as exceeding the > threshold and is skipped. > When a standby is replaying archiving log, it can still be caught up. This doesn't seem right to me. > > This parameter measures only the apply lag on the standby itself, i.e., > how much already-received WAL remains to be replayed. It does not > attempt to measure how far the standby is behind the primary. In > particular, a standby that is slow to receive WAL but fast to replay it > may report a small backlog here while still being significantly behind. > IMHO, this change appears to not meet the objective of routing connections/queries to the most up-to-date standby. Thanks, Satya
