When a cascading standby launches a new walsender, it fetches the current recovery timeline:

        /*
         * Use the recovery target timeline ID during recovery
         */
        if (am_cascading_walsender)
                ThisTimeLineID = GetRecoveryTargetTLI();

Comment in GetRecoveryTargetTLI() does this:

        /* RecoveryTargetTLI doesn't change so we need no lock to copy it */
        return XLogCtl->RecoveryTargetTLI;


That comment is not true. RecoveryTargetTLI can change during recovery, if you set recovery_target_timeline='latest'. In 'latest' mode, when the (apparent) end of WAL is reached, the archive is scanned for any new timeline history files that may have appeared. If a new timeline is found, RecoveryTargetTLI is updated, and recovery is continued on the new timeline.

Aside from the missing locking, I wonder what that does to a cascaded standby. If there is an active walsender running while RecoveryTargetTLI is changed, I think what will happen is that the walsender will continue to stream WAL from the old timeline, but because the startup process is now actually replaying from a different timeline, the walsender will send bogus WAL to the standby.

When a standby ends recovery, creates a new timeline, and switches to normal operation, postmaster terminates all walsenders because of the timeline change. But don't we have a race condition there, with similar effect? It might take a while for a walsender to die, and in that window, it might send bogus WAL to the cascaded standby.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to