Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Marco Nenciarini Mon, 16 Mar 2026 14:17:20 -0700

On Thu, Jan 29, 2026 at 11:33 AM Fujii Masao wrote:
> Interestingly, I couldn't reproduce it on v11 using the same test case.
> This makes me wonder whether the issue was introduced in v12 or later.

I did some investigation on this.  The bug actually reproduces on v11
with the same setup (archive-only upstream standby + cascading standby
with restore_command).  I ran the test and got the exact same error:

  FATAL: could not receive data from WAL stream:
    ERROR: requested starting point 0/4000000 is ahead of the WAL
    flush position of this server 0/3FFEED8

The reason Fujii-san might not have seen it on v11 is likely related
to how pg_basebackup works with recovery.conf.  In the PG12+
reproducer, sby1 has no primary_conninfo (just standby.signal), making
it archive-only.  But in PG11, when adapting the test, if sby1 retains
any primary_conninfo from the basebackup setup, it would stream from
the primary and its flush position would stay current, masking the bug.

I bisected further and found that the check causing the rejection was
introduced by commit abfd192b1b5 ("Allow a streaming replication
standby to follow a timeline switch", 2012-12-13, Heikki Linnakangas),
which first appeared in PG 9.3.  That commit added this validation in
StartReplication():

    if (am_cascading_walsender)
        FlushPtr = GetStandbyFlushRecPtr();
    else
        FlushPtr = GetFlushRecPtr();
    if (FlushPtr < cmd->startpoint)
    {
        ereport(ERROR,
                errmsg("requested starting point ... is ahead of
                        the WAL flush position ..."));
    }

Before that commit (PG 9.2), the walsender had no such check and would
just start sending from whatever position was requested, waiting for the
data to become available if needed.  So the bug has existed since PG 9.3,
not since PG 12.

The check itself is correct -- you shouldn't serve WAL you don't have.
The real issue is on the requesting side: the cascading standby asks for
a position it advanced to via archive recovery, which the upstream hasn't
reached yet.

Best regards,
Marco

On Thu, Jan 29, 2026 at 12:33 PM Fujii Masao <[email protected]> wrote:

> On Thu, Jan 29, 2026 at 2:03 AM Marco Nenciarini
> <[email protected]> wrote:
> >
> > Hi hackers,
> >
> > I've encountered a bug in PostgreSQL's streaming replication where
> cascading
> > standbys fail to reconnect after falling back to archive recovery. The
> issue
> > occurs when the upstream standby uses archive-only recovery.
> >
> > The standby requests streaming from the wrong WAL position (next segment
> boundary
> > instead of the current position), causing connection failures with this
> error:
> >
> >     ERROR: requested starting point 0/A000000 is ahead of the WAL flush
> >     position of this server 0/9000000
>
> Thanks for the report!
> I was also able to reproduce this issue on the master branch.
>
> Interestingly, I couldn't reproduce it on v11 using the same test case.
> This makes me wonder whether the issue was introduced in v12 or later.
>
> Do you see the same behavior in your environment?
>
> Regards,
>
> --
> Fujii Masao
>

Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Reply via email to